<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
    <title>BGs Labs</title>
    <subtitle>Burak Güngörs lab where he does some stuff</subtitle>
    <link rel="self" type="application/atom+xml" href="https://bgslabs.org/atom.xml"/>
    <link rel="alternate" type="text/html" href="https://bgslabs.org"/>
    <generator uri="https://www.getzola.org/">Zola</generator>
    <updated>2026-03-02T00:00:00+00:00</updated>
    <id>https://bgslabs.org/atom.xml</id>
    <entry xml:lang="en">
        <title>Why the heck are we still using Markdown??</title>
        <published>2026-03-02T00:00:00+00:00</published>
        <updated>2026-03-02T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://bgslabs.org/blog/why-are-we-using-markdown/"/>
        <id>https://bgslabs.org/blog/why-are-we-using-markdown/</id>
        
        <content type="html" xml:base="https://bgslabs.org/blog/why-are-we-using-markdown/">&lt;blockquote&gt;
&lt;p&gt;There are few things in life that
bring me much joy and hate at the same time.
Like chocolate that hurts when eaten and &lt;ins&gt;markdown&lt;&#x2F;ins&gt;.
Seriously why?? Half of the time we aren’t even
using the full language!&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;html-is-the-best-programming-language&quot;&gt;HTML is the best Programming Language!&lt;&#x2F;h2&gt;
&lt;p&gt;I know you’ve heard people say the only &lt;em&gt;programming&lt;&#x2F;em&gt; language they know is HTML.
And I know, we both rolled our eyes in discontent trying to get our PL papers
out of our assembled decks of papers on how HTML is only a markup language
and not a programming language.&lt;&#x2F;p&gt;
&lt;p&gt;I mean yes we’re on the right but that guy probably has something we don’t have.&lt;&#x2F;p&gt;
&lt;p&gt;A life.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;[Note] When I’m talking about markdown, I am specifically talking about &lt;a href=&quot;https:&#x2F;&#x2F;bgslabs.org&#x2F;blog&#x2F;why-are-we-using-markdown&#x2F;commonmark.org&quot;&gt;CommonMark&lt;&#x2F;a&gt;
Unless stated otherwise. Because it is the &lt;em&gt;unambiguous&lt;&#x2F;em&gt; syntax specification.
I love the project, I really appreciate their efforts on making this language a
bit more grounded. It’s not the specification that’s broken,
it’s the language itself.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;the-good&quot;&gt;The Good&lt;&#x2F;h2&gt;
&lt;p&gt;Markdown is a minimal language used for typesetting trivial documents.
It needs to do one simple job: get a Markdown file and output an HTML file.
Its syntax is legible as it gets and is easy to write even with no assists.
Like the C language you can see the output that will be created. Bold is always
&lt;code&gt;&amp;lt;b&amp;gt;&amp;lt;&#x2F;b&amp;gt;&lt;&#x2F;code&gt; at the end and italic the same.&lt;&#x2F;p&gt;
&lt;p&gt;Learning curve is simply nonexistent if you’re just a casual user. Just one look
at the cheat sheet and you’re ready.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-bad&quot;&gt;The Bad&lt;&#x2F;h2&gt;
&lt;p&gt;We don’t know what we want.&lt;&#x2F;p&gt;
&lt;p&gt;Do we want UI? Do we want a programming language? We don’t know.
The only reason feature creep exists is because of unclear specifications.&lt;&#x2F;p&gt;
&lt;p&gt;You want a &lt;strong&gt;MINIMAL&lt;&#x2F;strong&gt; easily legible &lt;strong&gt;markup&lt;&#x2F;strong&gt; language, you have markdown. Simple as that right?&lt;&#x2F;p&gt;
&lt;p&gt;well…&lt;&#x2F;p&gt;
&lt;p&gt;(output taken from &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;spec.commonmark.org&#x2F;dingus&#x2F;&quot;&gt;dingus&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;&lt;div class=&quot;sidebyside sidebyside-left-right&quot;&gt;
    &lt;div class=&quot;sidebyside-first&quot;&gt;
        &lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;# Hello&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;*I am an*&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;__Unambiguous__&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;gt; Grammar&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
    &lt;&#x2F;div&gt;
    &lt;div class=&quot;sidebyside-second&quot;&gt;
        &lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;html&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;h1&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;Hello&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;h1&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;p&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;em&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;I am an&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;em&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;strong&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;Unambiguous&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;strong&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;p&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;blockquote&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;p&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;Grammar&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;p&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;blockquote&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
    &lt;&#x2F;div&gt;
&lt;&#x2F;div&gt;

&lt;div class=&quot;sidebyside sidebyside-left-right&quot;&gt;
    &lt;div class=&quot;sidebyside-first&quot;&gt;
        &lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Hello&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;=====&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;_I am an_&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;**Unambiguous**&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;gt; Grammar&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
    &lt;&#x2F;div&gt;
    &lt;div class=&quot;sidebyside-second&quot;&gt;
        &lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;html&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;h1&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;Hello&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;h1&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;p&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;em&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;I am an&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;em&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;strong&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;Unambigious&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;strong&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;p&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;blockquote&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;p&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;Grammar&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;p&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;blockquote&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
    &lt;&#x2F;div&gt;
&lt;&#x2F;div&gt;
&lt;&#x2F;p&gt;
&lt;p&gt;I hope you have the 2 eyeballs enough to see that markdown is NOT what you asked for.
These 2 produce IDENTICAL output. And this is just the tip of the iceberg?&lt;&#x2F;p&gt;
&lt;p&gt;It has so many poor decisions baked in that if you try to use it it will actively
fight against you the moment you think you know what you’re doing.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;exhibit-a-bold-italic-bold-italic&quot;&gt;Exhibit A: bold, italic, bold-italic, ???&lt;&#x2F;h3&gt;
&lt;p&gt;In markdown you can write a bold in different ways.
&lt;code&gt;**bold**&lt;&#x2F;code&gt;, &lt;code&gt;__bold__&lt;&#x2F;code&gt;, &lt;code&gt;&amp;lt;b&amp;gt;bold&amp;lt;&#x2F;b&amp;gt;&lt;&#x2F;code&gt; are &lt;em&gt;some&lt;&#x2F;em&gt;
of the ways a valid bold can be written. And these are for commonmark.
If you’re using something which isn’t marketing itself as “CommonMark™ Compliant®©”
You can very well encounter &lt;em&gt;valid&lt;&#x2F;em&gt; stuff that produce the same input. Like:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;_*bold*_&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;*_bold*_&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;_*bold_*&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;*_bold_*&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Truly magnificent.&lt;&#x2F;p&gt;
&lt;p&gt;And please don’t let me get started on layered ones like:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;***Peter* Piper** _Picked___a___Pack_ *of** Pick_led_* Peppers&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;or this:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;*****\\*a*&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This thing is actually so peak that we have class of parser vulnerabilities called
&lt;strong&gt;ReDoS&lt;&#x2F;strong&gt; (Regular Expression Denial of Service) affecting this. Like &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;security.snyk.io&#x2F;vuln&#x2F;SNYK-JS-MARKDOWNIT-10666750&quot;&gt;this&lt;&#x2F;a&gt; 6.9 (nice) severity level CVE for &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;markdown-it&#x2F;markdown-it&quot;&gt;markdown-it&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;“markdown-it” is one of the most worked out, clean and easy to understand libraries for Markdown.
I simply love markdown-it. The fact that even this library is affected by it shows the
absolute state of how bad the situation is.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;exhibit-b-asm-was-a-good-idea-but-this&quot;&gt;Exhibit B: __asm__ Was a Good Idea, but This?&lt;&#x2F;h3&gt;
&lt;p&gt;In old languages where compilers were producing optimal code like a river in a desert.
Inline assembly helped them write performance critical code with ease in the cost of
the compiler engineers blood, sweat, tears, and the birth of their firstborn son.&lt;&#x2F;p&gt;
&lt;p&gt;It allowed stuff like SIMD operations before the compiler put support for them.
If you want an overview of early SIMD generation failures can take a look at &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bgslabs.org&#x2F;blog&#x2F;evolution-of-x86-simd&#x2F;#appendix-b-assembly-code-analysis&quot;&gt;here&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Now let’s take this wonderful idea and bolt it directly into the most bloated,
single threaded, sandboxed environment expecting a simple and easy way to write
documents. And this is how inline HTML inside Markdown was born!&lt;&#x2F;p&gt;
&lt;p&gt;Inline HTML allows you to do stuff like.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;markdown&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-markup z-heading&quot;&gt;#&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name&quot;&gt; Hi&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;I am a &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;ins&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; simple &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;ins&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-markup z-italic&quot;&gt; _&lt;&#x2F;span&gt;&lt;span class=&quot;z-markup z-italic&quot;&gt;programmer&lt;&#x2F;span&gt;&lt;span class=&quot;z-markup z-italic&quot;&gt;_&lt;&#x2F;span&gt;&lt;span&gt; doing&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;span&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity&quot;&gt; class&lt;&#x2F;span&gt;&lt;span&gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-string&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span class=&quot;z-string&quot;&gt;fancy-text&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-string&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; elegant &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;span&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; programming.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;div&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity&quot;&gt; class&lt;&#x2F;span&gt;&lt;span&gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-string&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span class=&quot;z-string&quot;&gt;animation&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-string&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;And here is my portfolio&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;div&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Isn’t this just simple! Isn’t this just neat! The main reason why &lt;strong&gt;correct&lt;&#x2F;strong&gt; markdown parsing is exceptionally hard isn’t
because Markdown syntax is so hard to comprehend. It’s only 1&#x2F;10th of the issue. The real issue is that to ship a
Markdown parser you also need a to ship a friendly HTML parser. And if you’re using HTML inside Markdown.
Why not use HTML from the start!&lt;&#x2F;p&gt;
&lt;p&gt;Said the person writing this using all the bells and whistles known to man, which are NOT in the standard.&lt;&#x2F;p&gt;
&lt;p&gt;Markdown in an of itself isn’t powerful enough to satisfy the simple monkey brained developer like me who is only satisfied
when the site looks good &lt;em&gt;enough™&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Good &lt;em&gt;enough™&lt;&#x2F;em&gt; in this case means it needs to have at least basic $\LaTeX$ with Tikz support with the ability to install packages,
PlantUML, Mermaid, custom styling, custom shortcodes, tagging and taxonomy, proper footnotes, Bibtex support…&lt;&#x2F;p&gt;
&lt;p&gt;I don’t want a simple job from this simple tool too. To nail a painting to the
wall I need a hammer. In this case the hammer is markdown. But if I wanted
to paint it too, I will break the canvas the moment I try to paint with the hammer.&lt;&#x2F;p&gt;
&lt;p&gt;Breaking the canvas also means a whole lot of CVEs, primarily around XSS vulnerabilities.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;details class=&quot;drawer&quot;&gt;
    &lt;summary class=&quot;drawer-header&quot;&gt;
        &lt;span class=&quot;drawer-title&quot;&gt;Inline HTML Related CVEs&lt;&#x2F;span&gt;
        &lt;span class=&quot;drawer-icon&quot;&gt;▼&lt;&#x2F;span&gt;
    &lt;&#x2F;summary&gt;
    &lt;div class=&quot;drawer-content&quot;&gt;
        &lt;ul&gt;
&lt;li&gt;CVE‑2025‑24981 (XSS vulnerability)&lt;&#x2F;li&gt;
&lt;li&gt;CVE‑2025‑46734 (XSS vulnerability)&lt;&#x2F;li&gt;
&lt;li&gt;CVE‑2025‑7969 (XSS vulnerability)&lt;&#x2F;li&gt;
&lt;li&gt;CVE‑2025‑60312 (XSS vulnerability)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;

    &lt;&#x2F;div&gt;
&lt;&#x2F;details&gt;

Every time we allow inline HTML, plugin hooks, or embedded execution engines, we expand the attack surface.&lt;&#x2F;p&gt;
&lt;p&gt;The result is predictable: recurring XSS vulnerabilities across major Markdown implementations.
And much, much more. And this will continue to rise with the vibe sh*tted parsers hitting the market.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;exhibit-c-obscure-and-old-syntax&quot;&gt;Exhibit C: Obscure and Old Syntax&lt;&#x2F;h3&gt;
&lt;p&gt;Markdown like all the good tech we have around the WorldWide Web, was made in the good old 00’s!
Markdown was inspired by preexisting conventions for marking up &lt;em&gt;plaintext&lt;&#x2F;em&gt; in email and usenet posts (&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arstechnica.com&#x2F;information-technology&#x2F;2014&#x2F;10&#x2F;markdown-throwdown-what-happens-when-foss-software-gets-corporate-backing&#x2F;&quot;&gt;ref&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;Mind you before 2000’s most serious mails looked like this&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;From: k...@rational.com (Kent Mitchell)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Subject: Re: Does memory leak?&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Date: 1995&#x2F;03&#x2F;31&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Norman H. Cohen (nco...@watson.ibm.com) wrote:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;: The only programs I know of with deliberate memory leaks are those whose&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;: executions are short enough, and whose target machines have enough&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;: virtual memory space, that running out of memory is not a concern.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;: (This class of programs includes many student programming exercises and&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;: some simple applets and utilities; it includes few if any embedded or&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;: safety-critical programs.)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;This sparked an interesting memory for me.  I was once working with a&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;customer who was producing on-board software for a missile.  In my analysis&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;of the code, I pointed out that they had a number of problems with storage&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;leaks.  Imagine my surprise when the customers chief software engineer said&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;quot;Of course it leaks&amp;quot;.  He went on to point out that they had calculated the&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;amount of memory the application would leak in the total possible flight time&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;for the missile and then doubled that number.  They added this much&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;additional memory to the hardware to &amp;quot;support&amp;quot; the leaks.  Since the missile&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;will explode when it hits its target or at the end of its flight, the&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ultimate in garbage collection is performed without programmer intervention.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;--&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Kent Mitchell                   | One possible reason that things aren&amp;#39;t&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Technical Consultant            | going according to plan is .....&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Rational Software Corporation   | that there never *was* a plan!&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Do you see the beauty of it? The quote syntax, the vertical separation with pipe (&lt;code&gt;|&lt;&#x2F;code&gt;) characters
the 80 line length limit. This was what markdown was &lt;em&gt;needed&lt;&#x2F;em&gt; for (which it would fail horribly at,
because in markdown you can’t divide the screen into N parts without inline HTML black magic)&lt;&#x2F;p&gt;
&lt;p&gt;But because of this legacy, you have 2 different ways of writing headings, the normal way and the ATX way.
You have 2 different ways and more than 2 CVEs of writing bold and italic. You have 2 different ways
of writing horizontal rule which collide which one of them collides with the setext header syntax. You have
2 different ways of writing unordered lists. You have an ordered list which don’t care about how
&lt;em&gt;you&lt;&#x2F;em&gt; ordered them. And you have a footnote syntax which moves this entire grammar to context dependent
grammar.&lt;&#x2F;p&gt;
&lt;p&gt;This language is literally the C++ of markup languages. Nearly everything can be done in 2 different ways
which some of them might allow XSS and somehow leak memory in html.&lt;&#x2F;p&gt;
&lt;p&gt;It’s used everywhere but it also fails everywhere, equally.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;exhibit-d-you-can-t-simply-just-parse-it-you-also-have-to-use-it&quot;&gt;Exhibit D: You Can’t Simply Just Parse It, You Also Have to Use It.&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;parsing-is-the-easy-thing&quot;&gt;Parsing Is the Easy Thing&lt;&#x2F;h4&gt;
&lt;p&gt;If you remember I told you that the footnote syntax moves this grammar up to a context sensitive one.
Let me elucidate you about it more.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;div class=&quot;sidebyside sidebyside-left-right&quot;&gt;
    &lt;div class=&quot;sidebyside-first&quot;&gt;
        &lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;markdown&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;test &lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-string z-other z-link&quot;&gt;^1&lt;&#x2F;span&gt;&lt;span&gt;]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
    &lt;&#x2F;div&gt;
    &lt;div class=&quot;sidebyside-second&quot;&gt;
        &lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;html&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;p&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;test [^1]&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-tag&quot;&gt;p&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
    &lt;&#x2F;div&gt;
&lt;&#x2F;div&gt;

&lt;div class=&quot;sidebyside sidebyside-left-right&quot;&gt;
    &lt;div class=&quot;sidebyside-first&quot;&gt;
        &lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;test [^1]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;[^1]: hello&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
    &lt;&#x2F;div&gt;
    &lt;div class=&quot;sidebyside-second&quot;&gt;
        &lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&amp;lt;p&amp;gt;test &amp;lt;a href=″hello″&amp;gt;^1&amp;lt;&#x2F;a&amp;gt;&amp;lt;&#x2F;p&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;*formatted &amp;amp; I had to change &quot; to ″ because technical reasons&lt;&#x2F;p&gt;

    &lt;&#x2F;div&gt;
&lt;&#x2F;div&gt;
&lt;&#x2F;p&gt;
&lt;p&gt;Actually footnotes are not supported in CommonMark, they have &lt;em&gt;links&lt;&#x2F;em&gt;.
The only notable difference is that link syntax means you can put only one word after
the definition as opposed to a shameful explanation of why bananas on pizza is a good idea.&lt;&#x2F;p&gt;
&lt;p&gt;Reference-style links and footnotes require global definition resolution.
A token’s meaning depends on declarations elsewhere in the document.
That breaks purely context-free parsing assumptions. Ergo the update
to a CSG from a CFG.&lt;&#x2F;p&gt;
&lt;p&gt;Which is a lot of words for saying.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you want a simple language, stay simple.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h4 id=&quot;rendering-is-the-hard-thing&quot;&gt;Rendering Is the Hard Thing&lt;&#x2F;h4&gt;
&lt;p&gt;I think this is the most controversial part about this rant. Because nobody wants to
admit that what we use and need are two very different things.&lt;&#x2F;p&gt;
&lt;p&gt;Vanilla Markdown needs a &lt;strong&gt;transliterator&lt;&#x2F;strong&gt; which is literally a 1:1 mapping function.
You see &lt;code&gt;**bold**&lt;&#x2F;code&gt;, &lt;code&gt;&amp;lt;b&amp;gt;bold&amp;lt;&#x2F;b&amp;gt;&lt;&#x2F;code&gt; comes out.&lt;&#x2F;p&gt;
&lt;p&gt;But modern Markdown has to support stuff like footnotes which takes this &lt;em&gt;simple&lt;&#x2F;em&gt;
transliterator into a full blown compiler. And after you’ve made this step, here’s
the ladder you’ll climb.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;I need a personal knowledge management system&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;requirement&lt;&#x2F;th&gt;&lt;th&gt;technicality&lt;&#x2F;th&gt;&lt;th&gt;result&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;I’ll need footnotes&lt;&#x2F;td&gt;&lt;td&gt;CFG to CSG grammar&lt;&#x2F;td&gt;&lt;td&gt;basic compiler&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;I’ll need custom callouts*&lt;&#x2F;td&gt;&lt;td&gt;hooks that bind HTML with md&lt;&#x2F;td&gt;&lt;td&gt;Dependency graphs&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;I’ll need math&lt;&#x2F;td&gt;&lt;td&gt;Need typesetting library integration&lt;&#x2F;td&gt;&lt;td&gt;More complex dependency graphs with execution engines needed&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;I’ll need custom styling&lt;&#x2F;td&gt;&lt;td&gt;Custom CSS&lt;&#x2F;td&gt;&lt;td&gt;CSS injection with file based scoping rules&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;(* like tera templating &#x2F; shortcodes)&lt;&#x2F;p&gt;
&lt;p&gt;I swear what I have to do in Obsidian work to my needs using &lt;em&gt;Markdown&lt;&#x2F;em&gt; makes it seem like Notion made a sane
decision copying Word &#x2F; Excel internally.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;solution&quot;&gt;Solution?&lt;&#x2F;h2&gt;
&lt;p&gt;If you ask the guy with a majestic beard with 30+ years of experience in 8086 assembly, the answer is plain text.&lt;&#x2F;p&gt;
&lt;p&gt;If you ask the guy with a Macbook and openclaw running at 3 different Mac minis making a startup a day, the answer is mdx.&lt;&#x2F;p&gt;
&lt;p&gt;If you ask the guy who codes in C for a living, the answer is ReStructured Text (.rst)&lt;&#x2F;p&gt;
&lt;p&gt;If you ask me. The answer is none. All are broken in their own ways. Plain text is beautiful
but I can’t show it to somebody that doesn’t know what a null pointer dereference is.
ReStructured text is wonderful if you only read it and never write it. And mdx is so busy trying to be html,
it forgets it needs to be legible.&lt;&#x2F;p&gt;
&lt;p&gt;And the biggest problem in all of them is that, they don’t have a build system. We cut corners in the name of speed.
If they had a build system, a sane unambiguous legible syntax purpose built for what it’s needed to be.
I think all of the problems will be fixed.&lt;&#x2F;p&gt;
&lt;p&gt;Don’t allow inline HTML, allow for well defined shortcodes and functions to manipulate the text.&lt;&#x2F;p&gt;
&lt;p&gt;Allow for custom hooks to be executed before, during and after the compilation.&lt;&#x2F;p&gt;
&lt;p&gt;And most importantly define everything. What we need is a custom built tool, not a Frankenstein’s
monster that we dare to call a language.&lt;&#x2F;p&gt;
&lt;p&gt;I’m not saying Markdown &#x2F; Markup should be a programming language;
I’m saying we are trying to use it like one, and because it lacks a formal foundation, it’s failing at both.&lt;&#x2F;p&gt;
&lt;p&gt;I genuinely think a proper build system around a saner markup language with compile time hook support can fix a lot
of the problems around this with the right constraints. At this point we should just let go of Markdown for good and
look for our answers elsewhere. Preferably with a trivially parsable grammar.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-do-i-mean-by-all-this&quot;&gt;What Do I Mean by All This&lt;&#x2F;h2&gt;
&lt;blockquote&gt;
&lt;p&gt;You can skip this if you understood what I’ve talked about.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;From a formal language theory perspective:&lt;&#x2F;p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;strong&gt;Markup Language&lt;&#x2F;strong&gt; (&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.britannica.com&#x2F;technology&#x2F;markup-language&quot;&gt;Encyclopedia Britannica&lt;&#x2F;a&gt;)&lt;&#x2F;dt&gt;
&lt;dd&gt;
&lt;p&gt;Markup language, standard text-encoding system consisting of a set of symbols
inserted in a text document to control its structure, formatting, or the
relationship between its parts…&lt;&#x2F;p&gt;
&lt;&#x2F;dd&gt;
&lt;dt&gt;&lt;strong&gt;Programming Language&lt;&#x2F;strong&gt; (&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.britannica.com&#x2F;technology&#x2F;computer-programming-language&quot;&gt;Encyclopedia Britannica&lt;&#x2F;a&gt;)&lt;&#x2F;dt&gt;
&lt;dd&gt;
&lt;p&gt;Computer programming language, any of various languages for expressing a set
of detailed instructions for a digital computer…&lt;&#x2F;p&gt;
&lt;&#x2F;dd&gt;
&lt;&#x2F;dl&gt;
&lt;p&gt;As you can CLEARLY see. Britannica is not the way to go. So I’ll use the other thing that I always trust in this life.
Which is common sense and define a programming language as.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A &lt;strong&gt;Language&lt;&#x2F;strong&gt; is a &lt;strong&gt;Programming Language&lt;&#x2F;strong&gt; if it is &lt;strong&gt;Turing Complete&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;For the definition of a &lt;em&gt;language&lt;&#x2F;em&gt; I’m taking the only source of truth which is “Chomsky’s framework” from 1956.&lt;&#x2F;p&gt;
&lt;details class=&quot;drawer&quot;&gt;
    &lt;summary class=&quot;drawer-header&quot;&gt;
        &lt;span class=&quot;drawer-title&quot;&gt;basically&lt;&#x2F;span&gt;
        &lt;span class=&quot;drawer-icon&quot;&gt;▼&lt;&#x2F;span&gt;
    &lt;&#x2F;summary&gt;
    &lt;div class=&quot;drawer-content&quot;&gt;
        &lt;p&gt;&lt;strong&gt;Language&lt;&#x2F;strong&gt;: A set of strings (finite sequences of symbols) over a finite alphabet.
$$L \subseteq \Sigma^* $$&lt;&#x2F;p&gt;
&lt;p&gt;To recap an alphabet ($\Sigma$) is basically a set with all the letters that you can imagine.
Or realistically anything that you can put into UNICODE without breaking your preferred OS’
parser.&lt;&#x2F;p&gt;
&lt;p&gt;$$
\begin{align}
\Sigma &amp;amp;= \lbrace a,b \rbrace  \newline
\Sigma_{Unicode} &amp;amp;= \lbrace char | char \in \text{Unicode Standard} \rbrace
\end{align}
$$&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;is a minimal alphabet that I made up.&lt;&#x2F;li&gt;
&lt;li&gt;is the alphabet that we use from now on when I mean an alphabet.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;And any language is $L$ where $L$ is any set of string made from the symbols from $\Sigma$&lt;&#x2F;p&gt;
&lt;p&gt;$$ L = \lbrace klasfdushsda,ksadhf, … \rbrace $$&lt;&#x2F;p&gt;
&lt;p&gt;is a valid language go figure.&lt;&#x2F;p&gt;

    &lt;&#x2F;div&gt;
&lt;&#x2F;details&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Type&lt;&#x2F;th&gt;&lt;th&gt;Name&lt;&#x2F;th&gt;&lt;th&gt;Formal Definition&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;Recursively Enumerable&lt;&#x2F;td&gt;&lt;td&gt;Generated by &lt;strong&gt;Turing machines&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;Context-Sensitive&lt;&#x2F;td&gt;&lt;td&gt;Generated by &lt;strong&gt;linear-bounded automata&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;Context-Free&lt;&#x2F;td&gt;&lt;td&gt;Generated by &lt;strong&gt;pushdown automata&lt;&#x2F;strong&gt; (e.g., programming language syntax)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;Regular&lt;&#x2F;td&gt;&lt;td&gt;Generated by &lt;strong&gt;finite automata&lt;&#x2F;strong&gt; (e.g., simple patterns, regex)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;In the food pyramid of languages, Context-Free languages are our
wheat and regular languages are our protein. A regular language is what your basic regex can parse.
A Context-Free language is what &lt;em&gt;BNF&#x2F;EBNF&lt;&#x2F;em&gt; can describe and context-sensitive is what we
call C++ syntax. I hope you don’t have to deal with the latter.&lt;&#x2F;p&gt;
&lt;p&gt;Recursively Enumerable languages are weird and is outside the topic of this coffee fueled rant of a blog.
But just to peak your interest. This technically is a valid Type-0 language&lt;&#x2F;p&gt;
&lt;p&gt;$$
HALT = \lbrace (P, x) | P \texttt{ halts on input } x \rbrace
$$&lt;&#x2F;p&gt;
&lt;p&gt;Which brings us to our second point.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Turing Completeness&lt;&#x2F;strong&gt;: A computational system that can compute every &lt;em&gt;Turing-computable&lt;&#x2F;em&gt; function is called Turing-complete (or Turing-powerful). Alternatively, such a system is one that can simulate a universal Turing machine. (&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20260228223759&#x2F;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Turing_completeness&quot;&gt;Wikipedia&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;A Turing-computable function is in layman’s terms. A function that can is either going to return a value of something
or die trying (i.e. halt) in either finite or infinite time.&lt;&#x2F;p&gt;
&lt;div class=&quot;sidebyside sidebyside-left-right&quot;&gt;
    &lt;div class=&quot;sidebyside-first&quot;&gt;
        &lt;p&gt;Turing Computable&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;python&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;def&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name&quot;&gt; good_function&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-parameter z-function&quot;&gt;inp&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-support&quot;&gt; int&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;	ret&lt;&#x2F;span&gt;&lt;span&gt; :&lt;&#x2F;span&gt;&lt;span class=&quot;z-support&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; 0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;	for&lt;&#x2F;span&gt;&lt;span&gt; i&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt; in&lt;&#x2F;span&gt;&lt;span class=&quot;z-support&quot;&gt; range&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span&gt;inp&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;		ret&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt; +=&lt;&#x2F;span&gt;&lt;span&gt; i&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;	return&lt;&#x2F;span&gt;&lt;span&gt; ret&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
    &lt;&#x2F;div&gt;
    &lt;div class=&quot;sidebyside-second&quot;&gt;
        &lt;p&gt;&lt;strong&gt;Not&lt;&#x2F;strong&gt; Turing Computable&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;python&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;def&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name&quot;&gt; busy_beaver&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-parameter z-function&quot;&gt;n&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-support&quot;&gt; int&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-constant&quot;&gt;	...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment&quot;&gt; #&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt; good luck&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
    &lt;&#x2F;div&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;Why this compsci-101 lecture about programming language &#x2F; language taxonomy?
Because I want you to understand that simple “wants” could cause a big change.&lt;&#x2F;p&gt;
&lt;p&gt;The thing separating a set of mathematical axioms from provability to non-provability
is the want of recursion (self-reference), which manifests itself as the common multiplication operation.
A trivial operation can have nontrivial reactions.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;more-chaos&quot;&gt;More chaos?&lt;&#x2F;h2&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Event&lt;&#x2F;th&gt;&lt;th&gt;Link&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Jeff Atwood calls Gruber “negligent” (2009)&lt;&#x2F;td&gt;&lt;td&gt;https:&#x2F;&#x2F;blog.codinghorror.com&#x2F;the-future-of-markdown&#x2F;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Standard Markdown → CommonMark rename (2014)&lt;&#x2F;td&gt;&lt;td&gt;https:&#x2F;&#x2F;www.metafilter.com&#x2F;142475&#x2F;Standard-flavored-Markdown&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Ars Technica: Markdown throwdown&lt;&#x2F;td&gt;&lt;td&gt;https:&#x2F;&#x2F;arstechnica.com&#x2F;information-technology&#x2F;2014&#x2F;10&#x2F;markdown-throwdown-what-happens-when-foss-software-gets-corporate-backing&#x2F;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;HN: “worst small program” quote&lt;&#x2F;td&gt;&lt;td&gt;https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=4700160&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Karl Voit: “Markdown is a Disaster” (2025)&lt;&#x2F;td&gt;&lt;td&gt;https:&#x2F;&#x2F;www.osnews.com&#x2F;story&#x2F;143128&#x2F;markdown-is-a-disaster-why-and-what-to-do-instead&#x2F;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;If you liked my banter here, you can also like these posts!
I hadn’t read them before actually writing this. My history with Markdown is as self contained as it gets.
I have used it in forums and predominantly in Obsidian. My thoughts here are of my own. I don’t
support anybody but I do think some people are more right than others.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>The Evolution of x86 SIMD: From SSE to AVX-512</title>
        <published>2026-01-16T00:00:00+00:00</published>
        <updated>2026-01-16T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://bgslabs.org/blog/evolution-of-x86-simd/"/>
        <id>https://bgslabs.org/blog/evolution-of-x86-simd/</id>
        
        <content type="html" xml:base="https://bgslabs.org/blog/evolution-of-x86-simd/">&lt;blockquote&gt;
&lt;p&gt;The story of x86 SIMD is &lt;em&gt;simply&lt;&#x2F;em&gt; not about technology.
It’s about marketing, &lt;ins&gt;corporate politics&lt;&#x2F;ins&gt;, engineering
compromises, competitive pressure. This is the behind-the-scenes
history of how Intel and AMD battled for vector supremacy, the
controversial decisions that defined an architecture, and the
&lt;strong&gt;personalities&lt;&#x2F;strong&gt; who made it happen.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;part-i-the-not-so-humble-beginnings-1993-1999&quot;&gt;Part I: The Not-So Humble Beginnings (1993-1999)&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;the-mmx-gamble-intel-s-israel-team-takes-a-huge-risk&quot;&gt;The MMX Gamble: Intel’s Israel Team Takes a Huge Risk&lt;&#x2F;h3&gt;
&lt;p&gt;The story of MMX begins not in &lt;span class=&quot;map-inline&quot;&gt;
    &lt;button
        class=&quot;map-trigger&quot;
        type=&quot;button&quot;
        data-lat=&quot;37.3541&quot;
        data-lon=&quot;-121.9552&quot;
        data-zoom=&quot;12&quot;
        data-title=&quot;Santa Clara, CA&quot;
    &gt;
        Santa Clara
    &lt;&#x2F;button&gt;

    &lt;span class=&quot;map-popover&quot;&gt;
        &lt;span class=&quot;map-header&quot;&gt;Santa Clara, CA&lt;&#x2F;span&gt;
        &lt;span class=&quot;map&quot;&gt;&lt;&#x2F;span&gt;
    &lt;&#x2F;span&gt;
&lt;&#x2F;span&gt;
, but in &lt;strong&gt;Haifa, Israel&lt;&#x2F;strong&gt;.
In 1993, Intel made an unprecedented decision: they would let their
Israel Development Center design and build a mainstream microprocessor,
the Pentium MMX, &lt;strong&gt;the first time Intel developed a flagship processor
outside the United States&lt;&#x2F;strong&gt;. &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;This was a massive gamble. According to Intel’s own technology journal,
the development of MMX technology &lt;strong&gt;spanned five years&lt;&#x2F;strong&gt; and involved
&lt;strong&gt;over 300 engineers&lt;&#x2F;strong&gt; across four Intel sites. At the center of this
effort was &lt;strong&gt;Uri Weiser&lt;&#x2F;strong&gt;, director of the Architecture group at the
IDC in Haifa.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Uri Weiser later recalled the struggle&lt;&#x2F;strong&gt; with characteristic
understatement: “Some people were ready to quit,”
He was named an &lt;strong&gt;Intel Fellow&lt;&#x2F;strong&gt; for his work on MMX architecture,
a rare honor that speaks to the significance of what the Israel team
accomplished.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Meanwhile, in &lt;span class=&quot;map-inline&quot;&gt;
    &lt;button
        class=&quot;map-trigger&quot;
        type=&quot;button&quot;
        data-lat=&quot;32.7940&quot;
        data-lon=&quot;34.9896&quot;
        data-zoom=&quot;12&quot;
        data-title=&quot;Haifa, Israel&quot;
    &gt;
        Haifa
    &lt;&#x2F;button&gt;

    &lt;span class=&quot;map-popover&quot;&gt;
        &lt;span class=&quot;map-header&quot;&gt;Haifa, Israel&lt;&#x2F;span&gt;
        &lt;span class=&quot;map&quot;&gt;&lt;&#x2F;span&gt;
    &lt;&#x2F;span&gt;
&lt;&#x2F;span&gt;
,
300 engineers were about to make a decision that
would haunt x86 for the next three decades.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-technical-reason-for-the-controversial-register-decision&quot;&gt;The Technical Reason for the Controversial Register Decision&lt;&#x2F;h3&gt;
&lt;p&gt;Here is where things get spicy. The most consequential and
controversial decision in MMX design was &lt;strong&gt;register aliasing&lt;&#x2F;strong&gt;. Intel
aliased the 8 new MMX registers (MM0-MM7) directly onto the existing
x87 floating-point register stack (ST(0)-ST(7)).&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#5&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Why they did this&lt;&#x2F;strong&gt;: To avoid adding new processor state. At the time,
operating systems only knew how to save&#x2F;restore the x87 FPU registers
during context switches. Adding 8 entirely new registers would have
required OS modifications across Windows, Linux, and every other x86
OS.&lt;&#x2F;p&gt;
&lt;p&gt;This was the 1990s, remember, convincing Microsoft to change
Windows was roughly as easy as convincing your cat to enjoy water
sports.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;The cost&lt;&#x2F;strong&gt;: You &lt;strong&gt;cannot mix floating-point and MMX instructions&lt;&#x2F;strong&gt; in
the same routine without risking register corruption. Programmers must
use the &lt;code&gt;EMMS&lt;&#x2F;code&gt; (Empty MMX State) instruction to switch between modes,
and even then, there’s overhead.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#6&quot;&gt;4&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; Think of it like sharing a closet
with your neighbor: sure, it saves space, but good luck finding your
socks when they’ve mysteriously migrated to the other person’s side.&lt;&#x2F;p&gt;
&lt;p&gt;The register state mapping can be expressed as:&lt;&#x2F;p&gt;
&lt;p&gt;$$
\forall i \in {0,\dots,7}: \text{MM}_i \equiv \text{ST}(i)
$$&lt;&#x2F;p&gt;
&lt;p&gt;where $\equiv$ denotes hardware-level aliasing (same physical storage).&lt;&#x2F;p&gt;
&lt;p&gt;Intel’s engineers knew this was a compromise. But
they made a calculated bet: most multimedia applications separate data
generation (FP) from display (SIMD), so the restriction would rarely
matter in practice.&lt;&#x2F;p&gt;
&lt;p&gt;They were mostly right. Mostly…&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-mmx-naming-controversy&quot;&gt;The “MMX” Naming Controversy&lt;&#x2F;h3&gt;
&lt;p&gt;Intel pulled a masterstroke with the MMX name. Officially, MMX is a
&lt;strong&gt;meaningless initialism&lt;&#x2F;strong&gt;, not an acronym at all. Intel trademarked the
letters “MMX” specifically to prevent competitors from using them.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#7&quot;&gt;5&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;The internal debate&lt;&#x2F;strong&gt;: Unofficially, the name was derived from either:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MultiMedia eXtension&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Matrix Math eXtension&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Intel has never officially confirmed which, because apparently they
wanted to preserve the mystique. Or maybe they forgot. Hard to say.&lt;&#x2F;p&gt;
&lt;p&gt;When AMD produced marketing material suggesting MMX stood for “Matrix
Math Extensions” (based on internal Intel documents), Intel &lt;strong&gt;sued AMD&lt;&#x2F;strong&gt;
in 1997 with the enthusiasm of a copyright troll at a convention,
claiming trademark infringement. AMD argued that “MMX” was a generic
term for multimedia extensions.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#8&quot;&gt;6&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;The settlement&lt;&#x2F;strong&gt;: AMD eventually acknowledged MMX as Intel’s trademark
and received rights to use the name on their chips. But Intel’s
aggressive legal stance sent a message: this was their playground, and
competitors would have to find their own identity. (Looking at you,
3DNow!)&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-marketing-hype-backlash&quot;&gt;The Marketing Hype Backlash&lt;&#x2F;h3&gt;
&lt;p&gt;Intel launched MMX with a &lt;strong&gt;Super Bowl commercial&lt;&#x2F;strong&gt; featuring Jason
Alexander, promising revolutionary multimedia capabilities. The hype was
enormous.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#9&quot;&gt;7&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; This was 1997, when Super Bowl commercials were still an
event and people actually watched them for the ads.&lt;&#x2F;p&gt;
&lt;p&gt;When the Pentium MMX shipped, reviewers found that
for non-optimized applications, the real-world performance gain was
only &lt;strong&gt;10-20%&lt;&#x2F;strong&gt;, mostly from the doubled L1 cache (32KB vs 16KB).&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#10&quot;&gt;8&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;One technology journalist called MMX “90%
marketing and 10% technical innovation.” PC Magazine Labs found only
modest gains for existing Windows 95 applications.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Intel’s defense&lt;&#x2F;strong&gt;: They claimed 50-700% improvements for MMX-optimized
software, but the catch was obvious: &lt;strong&gt;almost no software was optimized
at launch&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Now, to put into perspective, a textbook example of where this would
help is in a function like this&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;c&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;void&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name&quot;&gt;add_i32&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; dest&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt; const&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; a&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt; const&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; b&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; n&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;void&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name&quot;&gt;add_i32&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; dest&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt; const&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; a&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt; const&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; b&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; n&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;{&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span&gt; i&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt; for&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span&gt;i&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;span&gt; i&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span&gt;n&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;++&lt;&#x2F;span&gt;&lt;span&gt;i&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable&quot;&gt;  dest&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span&gt;i&lt;&#x2F;span&gt;&lt;span&gt;]&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; a&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span&gt;i&lt;&#x2F;span&gt;&lt;span&gt;]&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt; +&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; b&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span&gt;i&lt;&#x2F;span&gt;&lt;span&gt;]&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Which in turn should produce a beautiful MMX register using delicacies like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;movq&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; mm0&lt;&#x2F;span&gt;&lt;span&gt;, [a+i]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;movq&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; mm1&lt;&#x2F;span&gt;&lt;span&gt;, [b+i]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;paddd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; mm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;mm1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;movq&lt;&#x2F;span&gt;&lt;span&gt; [dst+i], &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;mm0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;(or even better just unroll the loop, but for the sake of argument I’m omitting that)&lt;&#x2F;p&gt;
&lt;p&gt;but in reality gcc2.7.2.3 produced this thing:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; cc -O2 -S test.c&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;movl (%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebx&lt;&#x2F;span&gt;&lt;span&gt;,%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;), %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;edi&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;   ; load a[i]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;addl (%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ecx&lt;&#x2F;span&gt;&lt;span&gt;,%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;), %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;edi&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;   ; add b[i]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;movl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;edi&lt;&#x2F;span&gt;&lt;span&gt;, (%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;esi&lt;&#x2F;span&gt;&lt;span&gt;,%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;)   &lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;; store dst[i]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;incl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;eax&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;cmpl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;edx&lt;&#x2F;span&gt;&lt;span&gt;, %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;eax&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Which is comparing a car to a bicycle. Yes it will be correct but it’s simply
too slow.&lt;&#x2F;p&gt;
&lt;p&gt;There is no “polite” C code in 1997 that nudges GCC 2.7.x into MMX.
You can write &lt;code&gt;restrict&lt;&#x2F;code&gt; but it will not work.&lt;&#x2F;p&gt;
&lt;p&gt;You either write &lt;strong&gt;MMX&lt;&#x2F;strong&gt; &lt;em&gt;explicitly&lt;&#x2F;em&gt;, or you don’t get &lt;strong&gt;MMX at all&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;See Appendix B &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#code1&quot;&gt;9&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; for comprehensive line-by-line analysis of this assembly output, including verification of GCC 2.7.2.3 authenticity and explanation of why MMX instructions were not generated.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;sse-intel-s-response-to-amd-s-3dnow&quot;&gt;SSE: Intel’s Response to AMD’s 3DNow&lt;&#x2F;h3&gt;
&lt;p&gt;While MMX was still proving itself, Intel’s product definition team made
a bold proposal: add SIMD floating-point capabilities to the next
processor, code-named “Katmai.”&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#11&quot;&gt;10&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;The internal debate&lt;&#x2F;strong&gt;: Intel executives were hesitant. MMX hadn’t even
shipped yet. Were they betting too heavily on SIMD? Was this just
another marketing gimmick?&lt;&#x2F;p&gt;
&lt;p&gt;According to Intel’s own account, the meeting was “inconclusive.”
Executives demanded more questions be answered. &lt;strong&gt;Two weeks later&lt;&#x2F;strong&gt;, they
gave the OK for Katmai (later named Pentium III).&lt;&#x2F;p&gt;
&lt;p&gt;Meanwhile, in &lt;span class=&quot;map-inline&quot;&gt;
    &lt;button
        class=&quot;map-trigger&quot;
        type=&quot;button&quot;
        data-lat=&quot;37.3688&quot;
        data-lon=&quot;-122.0363&quot;
        data-zoom=&quot;12&quot;
        data-title=&quot;Sunnyvale, CA&quot;
    &gt;
        Sunnyvale
    &lt;&#x2F;button&gt;

    &lt;span class=&quot;map-popover&quot;&gt;
        &lt;span class=&quot;map-header&quot;&gt;Sunnyvale, CA&lt;&#x2F;span&gt;
        &lt;span class=&quot;map&quot;&gt;&lt;&#x2F;span&gt;
    &lt;&#x2F;span&gt;
&lt;&#x2F;span&gt;
, AMD was watching. And plotting.&lt;&#x2F;p&gt;
&lt;p&gt;AMD’s &lt;strong&gt;3DNow!&lt;&#x2F;strong&gt;, introduced in the K6-2 in May 1998, was a direct
response to MMX’s biggest weakness: &lt;strong&gt;no floating-point SIMD&lt;&#x2F;strong&gt;. AMD
added 21 instructions that could handle single-precision floating-point
operations in parallel.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#12&quot;&gt;11&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Suddenly, Intel’s fancy new multimedia
extension couldn’t actually do the floating-point math that 3D graphics
required. Oops :p&lt;&#x2F;p&gt;
&lt;p&gt;When Pentium III (Katmai) shipped in February 1999, it introduced &lt;strong&gt;SSE
(Streaming SIMD Extensions)&lt;&#x2F;strong&gt; with 70 new instructions and &lt;strong&gt;8 entirely
new 128-bit registers&lt;&#x2F;strong&gt; (XMM0-XMM7).&lt;&#x2F;p&gt;
&lt;p&gt;Intel added new registers, costing an extra processor state
and requiring OS modifications (looking at you again,
&lt;em&gt;Microsoft&lt;&#x2F;em&gt;). Nevertheless Intel implemented the 128-bit floating-point
units in a “hack “way. A 4-way SSE instruction gets broken into &lt;strong&gt;two 64-bit
microinstructions&lt;&#x2F;strong&gt;, executed on two separate units.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#13&quot;&gt;12&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Intel “sorta” succeeded in adding 128-bit SIMD FP. The implementation
was clever, efficient, and space-conscious, but it was a hack that would haunt
optimization efforts for years.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#14&quot;&gt;13&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; The word “sorta” appears in
technical documentation approximately never, which tells you something
about just how much of a hack this was!&lt;&#x2F;p&gt;
&lt;p&gt;It might be worth noting that this persisted for a long time (Pentium M, Core 2).
Intel didn’t get true &lt;em&gt;single-cycle&lt;&#x2F;em&gt; 128-bit width until Core 2 (Conroe) for
some ops, and fully in later gens. AMD actually beat them to true 128-bit width in
hardware execution units with the K8&#x2F;K10 in some aspects.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;part-ii-the-sse-wars-2000-2008&quot;&gt;Part II: The SSE Wars (2000-2008)&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;sse2-2000-the-pentium-4-s-and-sledgehammers&quot;&gt;SSE2 (2000): The Pentium 4’s and Sledgehammers&lt;&#x2F;h3&gt;
&lt;p&gt;Intel’s SSE2 wasn’t driven by a new application breakthrough. It was a
&lt;strong&gt;defensive move against AMD’s 3DNow! and the looming threat of K7
(Athlon)&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Intel was under immense pressure. AMD’s Athlon K6-2 had
demonstrated that SIMD instructions mattered for gaming
and 3D graphics. And they needed to do something fast.&lt;&#x2F;p&gt;
&lt;p&gt;The key driver was &lt;strong&gt;real-time 3D gaming and DirectX performance&lt;&#x2F;strong&gt;.
Microsoft had been pushing Intel for better SIMD support since
DirectX 7.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#16&quot;&gt;14&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;SSE2 introduced 144 new instructions including double-precision FP:&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#15&quot;&gt;15&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; SSE2 double-precision operations (64-bit lanes)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;movapd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;   xmm0&lt;&#x2F;span&gt;&lt;span&gt;, [&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;rax&lt;&#x2F;span&gt;&lt;span&gt;] &lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;; Load aligned packed doubles&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;addpd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;    xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; Add: xmm0[63:0]   += xmm1[63:0]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                     ;      xmm0[127:64] += xmm1[127:64]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;mulpd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;    xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; Multiply packed doubles&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;sqrtpd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;   xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm3&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; Square root per 64-bit lane&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;unpckhpd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; try saying that 5 times :D&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                     ; Unpack high doubles&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;cvtpd2ps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; Convert packed doubles to singles&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In 2003, AMD’s AMD K8 (“Sledgehammer”) had an integrated the memory controller,
introduced HyperTransport and had a little instruction set difference called
AMD64 (aka x86-64). Because of this Intel had to rethink everything. Not only SIMD but
now also their whole instruction set had to be renovated which only came a
year later in 2004.&lt;&#x2F;p&gt;
&lt;p&gt;While Intel had invested heavily in IA-64 (Itanium) for 64-bit,
it could not run x86 natively without software translation.
AMD64 offered a 64-bit path without breaking existing software,
giving AMD a massive practical advantage&lt;&#x2F;p&gt;
&lt;p&gt;x86 without hacks would cap out at 4GB of virtual memory per process,
and had only 8 general purpose registers, allowed x87 floating-point with &lt;strong&gt;optional&lt;&#x2F;strong&gt; SSE.&lt;&#x2F;p&gt;
&lt;p&gt;AMD64 added on top of that with &lt;strong&gt;&lt;em&gt;long mode&lt;&#x2F;em&gt;&lt;&#x2F;strong&gt; while everything else
stayed the same in &lt;em&gt;compatability&lt;&#x2F;em&gt; mode. Now you had the chance of +4GB memory per process.
More general purpose registers (16 in total), and much more without sacrificing
compatability.&lt;&#x2F;p&gt;
&lt;p&gt;AMD64 in long mode also guaranteed SSE2 which made compilers jobs easier.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;sse3-2004-prescott-s-reckoning&quot;&gt;SSE3 (2004): Prescott’s Reckoning&lt;&#x2F;h3&gt;
&lt;p&gt;SSE3’s official driver was “media encoding improvements,” but the
&lt;strong&gt;real story is far more troubled&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;SSE3 was introduced with Prescott, (aka. PNI, Presscot New Instructions),
the 90nm Pentium 4 revision that would become Intel’s biggest nightmare. The 13 new
instructions were heavily trimmed due to power concerns.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#17&quot;&gt;16&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The new instructions could be used to accelerate 3D workflows and video codecs.
Like normally, Intel released the hardware first and waited for software to catch up to later.
With one exception. Intel C++ 8.0 compiler, which supported SSE3 Instructions.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#19&quot;&gt;17&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Although Intel released the SSE3 instructions guidelines for
software developers last summer, there are no programs yet,…&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;… according to Intel, `LDDQU`` instruction could speed up video
compression by 10% if used in data encoding algorithms…&lt;&#x2F;p&gt;
&lt;p&gt;Ilya Gavrichenkov (xbitlabs.com, 02.01.2004) &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#19&quot;&gt;17&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Horizontal operations (operating within a single register lane) were
a new concept:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; SSE3 horizontal operations&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;haddpd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;   ; Horizontal add packed doubles&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                    ; Before: xmm0 = {a0, a1}, xmm1 = {b0, b1}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                    ; After:  xmm0 = {b0+b1,a0+a1}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;hsubpd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;   ; Horizontal subtract packed doubles&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                    ; After: xmm0 = {a0-a1, b0-b1}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;movddup&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; Move and duplicate low double&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                    ; xmm1 = {a, b} -&amp;gt; xmm0 = {a, a}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;movshdup&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt; ; Move and shuffle singles (high)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                    ; xmm1 = {a0, a1, a2, a3} -&amp;gt; xmm0 = {a1, a1, a3, a3}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#†&quot;&gt;18&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Intel executives had acknowledged the growing challenges with clock
speed scaling as the industry hit what some called a “power wall.”&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#18&quot;&gt;19&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;
Prescott’s 31-stage pipeline generated so much heat that Intel had to
&lt;strong&gt;cut SSE3 instruction complexity&lt;&#x2F;strong&gt; to reduce power draw. The thermal
challenges were significant enough that power efficiency became a
primary concern in processor design.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#19&quot;&gt;17&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;ssse3-2006-the-core-2-rebirth&quot;&gt;SSSE3 (2006): The Core 2 Rebirth&lt;&#x2F;h3&gt;
&lt;p&gt;SSSE3 (Supplemental Streaming SIMD Extensions 3) wasn’t planned as a
separate extension. It was &lt;strong&gt;emergency additions to fix Core architecture’s
weaknesses&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;When Intel abandoned NetBurst for Core (Conroe&#x2F;Merom), they discovered
their new architecture lacked certain acceleration paths. The 16 new
instructions in SSSE3 (including &lt;strong&gt;PMULHRSW, PABSB&#x2F;PABSW&#x2F;PABSD, and
PALIGNR&lt;&#x2F;strong&gt;) were specifically designed to address common performance
bottlenecks. &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#20&quot;&gt;20&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;SSSE3 was introduced with the Intel Xeon processor 5100
series and Intel Core 2 processor family. SSSE3 offer 32 instructions
to accelerate processing of SIMD integer data.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; SSSE3 new instructions&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;pabsb&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;    xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;       ; Packed absolute value (byte)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;pabsw&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;    xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;       ; Packed absolute value (word)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;pabsd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;    xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;       ; Packed absolute value (dword)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;phaddw&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;   xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;       ; Packed horizontal add (word)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;phaddd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;   xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;       ; Packed horizontal add (dword)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;phsubw&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;   xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;       ; Packed horizontal sub (word)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;phsubd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;   xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;       ; Packed horizontal sub (dword)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;pmulhrsw&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt; ; Packed multiply high (rounded)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;palignr&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;  xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;3&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;    ; Packed align right&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;One could think that these are actually not real ALU instructions
but actually a result of a cat walking over on an Intel engineers
keyboard and pressed random buttons. If you though about that
you’ll be right on one of those assumptions.&lt;&#x2F;p&gt;
&lt;p&gt;Those are not &lt;em&gt;purely&lt;&#x2F;em&gt; &lt;strong&gt;ALU&lt;&#x2F;strong&gt; instructions.&lt;&#x2F;p&gt;
&lt;p&gt;These instructions were added &lt;strong&gt;without changing the Core microarchitecture&lt;&#x2F;strong&gt;.
They were largely microcode&#x2F;decoder based additions. These instructions
did not introduce new arithmetic capabilities or execution units;
they collapsed common multi-instruction SIMD idioms into single operations
that mapped onto existing ALUs and shuffle units.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;sse4-2007&quot;&gt;SSE4 (2007)&lt;&#x2F;h3&gt;
&lt;p&gt;SSE4 was split into two parts,
&lt;strong&gt;SSE4.1 (video&#x2F;graphics)&lt;&#x2F;strong&gt; and &lt;strong&gt;SSE4.2 (database&#x2F;text)&lt;&#x2F;strong&gt;. This was
deliberate, Intel didn’t want to wait for database features to ship
with video acceleration.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;strong&gt;H.264 video encoding explosion&lt;&#x2F;strong&gt; drove SSE4.1. By 2006, YouTube
was growing and everyday video creation and consumption were consuming
massive CPU resources, and Intel needed hardware acceleration.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;14 new video-oriented instructions&lt;&#x2F;strong&gt; were specifically designed for
H.264 encoding: &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#22&quot;&gt;21&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#23&quot;&gt;22&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MPSADBW&lt;&#x2F;strong&gt; - Multi-hypothesis Motion Estimation (4x4 SAD calculations)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;PHMINPOSUW&lt;&#x2F;strong&gt; - Horizontal Minimum Position (used in motion vector selection)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;DP&lt;&#x2F;strong&gt; - Dot Product (floating-point, for video filtering)
…&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; SSE4.1 video encoding instructions&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;mpsadbw&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;0&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt; ; Multi-sum absolute differences&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                      ; Computes 8 SAD operations between blocks&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;phminposuw&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt; ; Horizontal min pos (unsigned word)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                      ; Finds minimum value and its position in the packed words&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;dpps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;0xFF&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt; ; Dot product of packed singles&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                      ; xmm0[0] = sum(xmm0[i] * xmm1[i]) for i=0..3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;pmaxsb&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;     ; Packed maximum (signed byte)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;pminub&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;     ; Packed minimum (unsigned byte)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;pextrb&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;5&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; Extract byte 5 to low byte of xmm0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;pinsrd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; Insert dword into position 2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In theory new instructions significantly accelerated motion estimation workloads.&lt;&#x2F;p&gt;
&lt;p&gt;Penryn showed significant improvements in video encoding over Core 2 at
same clock speeds. Intel’s Fall 2007 IDF demo showed x264 encoding
performance improvements that were substantial enough to generate
significant developer interest in optimizing their code.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#25&quot;&gt;23&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;sse4-2-2008-nehalem-s-database-revolution&quot;&gt;SSE4.2 (2008): Nehalem’s Database Revolution&lt;&#x2F;h3&gt;
&lt;p&gt;Intel’s focus on data center and enterprise workloads wasn’t born from an
acquisition of an existing database team, it was shaped by &lt;strong&gt;two strategic XML
acquisitions&lt;&#x2F;strong&gt;. In &lt;strong&gt;August 2005&lt;&#x2F;strong&gt;, Intel acquired &lt;strong&gt;Sarvega&lt;&#x2F;strong&gt;, an XML networking
company. In &lt;strong&gt;February 2006&lt;&#x2F;strong&gt;, they followed up by acquiring &lt;strong&gt;Conformative&lt;&#x2F;strong&gt;,
an XML processing startup.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#26&quot;&gt;24&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;These acquisitions could have brought expertise in text processing and XML acceleration
into Intel’s Software and Solutions Group. The engineering knowledge
from Sarvega and Conformative probably influenced the &lt;strong&gt;STTNI (String and Text
New Instructions)&lt;&#x2F;strong&gt; in SSE4.2, first shipping with Nehalem in 2008.&lt;&#x2F;p&gt;
&lt;p&gt;Four instructions were &lt;strong&gt;specifically designed for database and string
processing&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CRC32&lt;&#x2F;strong&gt; - Hardware-accelerated checksums (for storage&#x2F;network)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;POPCNT&lt;&#x2F;strong&gt; - Population count (for Bloom filters, compression)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;PCMPESTRI&#x2F;PCMPISTRI&lt;&#x2F;strong&gt; - String comparison (for text search)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; SSE4.2 string processing and CRC&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;crc32&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; eax&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;byte&lt;&#x2F;span&gt;&lt;span&gt; [&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;rax&lt;&#x2F;span&gt;&lt;span&gt;]       &lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;; CRC32 of single byte&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;crc32&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; eax&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ax&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;               ; CRC32 of word&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;crc32&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; eax&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;              ; Accumulate CRC32&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;popcnt&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; rax&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;rbx&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;             ; Population count (BMI2, but SSE4.2 precursor)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;pcmpestri&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;0x00&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; Packed compare explicit length strings&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                            ; Searches for equality, returns index in ecx&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;pcmpistri&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;0x04&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; Packed compare implicit length strings&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;                            ; Negative imm8 = search for equality&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The CRC32 instruction alone &lt;strong&gt;reduced ZFS&#x2F;Btrfs checksum overhead
significantly&lt;&#x2F;strong&gt;, making storage operations notably faster.&lt;&#x2F;p&gt;
&lt;p&gt;The new string processing instructions generated considerable discussion
in the developer community. One example was of Austing Zhang of Intel who
claimed “After basic testing with iSCSI and confirmed that the iSCSI head
digest routines can be speeded up by 4x - 10x.” &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#27&quot;&gt;25&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Intel initially wanted to call SSE4.2 “SSE5” but AMD had already announced
SSE5 (with different 3-operand format). This led to the confusing
naming that persists today, because nothing says “clear technical
vision” like having two companies use the same numbers for completely
different things.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;part-iii-the-birth-of-avx-2008-2011&quot;&gt;Part III: The Birth of AVX (2008-2011)&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;march-2008-the-announcement&quot;&gt;March 2008: The Announcement&lt;&#x2F;h3&gt;
&lt;p&gt;Intel officially announced AVX (then called “Gesher New Instructions”)
in &lt;strong&gt;March 2008&lt;&#x2F;strong&gt;. The codename “Gesher” means “bridge” in Hebrew,
later changed to “Sandy Bridge New Instructions” as the microarchitecture
name took precedence.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#29&quot;&gt;26&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The announcement came through leaked slides in August 2008, which
revealed Intel’s roadmap including 8-core CPUs and the new AVX
instruction set.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#30&quot;&gt;27&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; Because nothing says “carefully planned
announcement” like your roadmap getting leaked to Engadget two months
early.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;why-256-bits&quot;&gt;Why 256 Bits?&lt;&#x2F;h3&gt;
&lt;p&gt;From Intel’s official documentation, &lt;strong&gt;three key factors drove the
256-bit decision&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Floating-Point Performance Doubling&lt;&#x2F;strong&gt;: The primary goal was to
double floating-point throughput for vectorizable workloads. Sandy
Bridge’s execution units were specifically reworked to achieve this.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#31&quot;&gt;28&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Forward Scalability&lt;&#x2F;strong&gt;: As noted in Intel’s AVX introduction
documentation: &lt;em&gt;“Intel AVX is designed to support 512 or 1024 bits
in the future.”&lt;&#x2F;em&gt; The 256-bit design was explicitly chosen as a
stepping stone.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Manufacturing Reality&lt;&#x2F;strong&gt;: Moving to 256 bits was achievable on
Intel’s 32nm process without excessive die area penalties, while
512 bits would have required more significant architectural changes.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;This was Intel essentially saying: “256 bits is just the beginning.
Wait until you see what we’ve got planned.” Spoiler: what they had
planned was a fragmented nightmare that would make Linus Torvalds
do what he did best.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-three-operand-non-destructive-instruction-decision&quot;&gt;The Three-Operand Non-Destructive Instruction Decision&lt;&#x2F;h3&gt;
&lt;p&gt;The shift from destructive two-operand instructions (A = A + B) to
non-destructive three-operand instructions (C = A + B) addressed a
fundamental compiler and programmer pain point:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Previous SSE instructions&lt;&#x2F;strong&gt; (destructive):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;addps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; xmm0 = xmm0 + xmm1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;AVX non-destructive&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vaddps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; xmm0 = xmm1 + xmm2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;Why this mattered&lt;&#x2F;strong&gt;:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reduced Register Spilling&lt;&#x2F;strong&gt;: Compilers no longer needed extra
instructions to save&#x2F;restore values before operations. This was
like finally getting a larger desk, you could actually spread out
your work instead of constantly shuffling papers.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Better Code Generation&lt;&#x2F;strong&gt;: Three-operand form enables more
efficient instruction scheduling. (Which is a significant step up from the Itanium
disaster.) The compiler could think ahead instead of constantly working
around the destructiveness of existing instructions.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reduced Code Size&lt;&#x2F;strong&gt;: Though VEX encoding is more complex, avoiding
register copy operations often results in smaller overall code.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#32&quot;&gt;29&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;AVX removed artificial ISA constraints without abandoning dynamic OoO scheduling.&lt;&#x2F;p&gt;
&lt;p&gt;The “VEX encoding scheme” was introduced specifically to support this
three-operand format while maintaining backwards compatibility. Intel
basically invented a new instruction format that could still run old
code.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; VEX encoded AVX instructions (2- or 3-byte VEX prefix)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; 3-operand non-destructive&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vaddps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; ymm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;       ; YMM0 = YMM1 + YMM2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; Operands can overlap (source also destination)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vaddps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; ymm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;       ; YMM1 = YMM1 + YMM2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; Scalar operations using VEX&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vsqrtss&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; xmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;xmm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;      ; Scalar: xmm0[31:0] = sqrt(xmm2[31:0])&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; Memory operand with VEX&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vaddps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; ymm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm1&lt;&#x2F;span&gt;&lt;span&gt;, [&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;rax&lt;&#x2F;span&gt;&lt;span&gt;+&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;256&lt;&#x2F;span&gt;&lt;span&gt;]  &lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;; Load from memory&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;amd-s-bulldozer-influence&quot;&gt;AMD’s Bulldozer Influence&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;strong&gt;May 2009&lt;&#x2F;strong&gt;: AMD announced they would support Intel’s AVX instructions&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;August 2010&lt;&#x2F;strong&gt;: AMD announced Bulldozer microarchitecture details&lt;&#x2F;p&gt;
&lt;p&gt;AMD had developed XOP (eXtended Operations) as their own 128-bit SIMD
extension before deciding to support Intel’s AVX instead.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#33&quot;&gt;30&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; This
suggests AMD recognized Intel’s direction was gaining industry momentum.
Sometimes the best strategy is to stop fighting and join the party.&lt;&#x2F;p&gt;
&lt;p&gt;Intel’s aggressive 256-bit implementation in Sandy Bridge was widely
seen as a move to maintain SIMD leadership against AMD’s competing
designs. The message was clear: Intel wasn’t going to let AMD dictate
the future of x86 SIMD.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;target-workloads&quot;&gt;Target Workloads&lt;&#x2F;h3&gt;
&lt;p&gt;From Intel’s AVX introduction materials:&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#32&quot;&gt;29&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;High-Performance Computing (HPC)&lt;&#x2F;strong&gt;: Climate modeling, molecular
dynamics, quantum chemistry simulations&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Media and Entertainment&lt;&#x2F;strong&gt;: Video encoding&#x2F;decoding, image
processing, 3D rendering&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Scientific Computing&lt;&#x2F;strong&gt;: Finite element analysis, computational
fluid dynamics, seismic processing&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Signal Processing&lt;&#x2F;strong&gt;: Radar systems, communications systems,
medical imaging&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Intel was explicitly targeting the workloads where GPUs were starting
to make inroads. The message was clear: you don’t need a graphics card
to do vector math. Just buy more Intel chips. (Spoiler: this didn’t
entirely work out as planned.)&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;part-iv-the-road-to-avx-512-2011-2016&quot;&gt;Part IV: The Road to AVX-512 (2011-2016)&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;the-fma-controversy-amd-vs-intel&quot;&gt;The FMA Controversy: AMD vs. Intel&lt;&#x2F;h3&gt;
&lt;p&gt;This was one of x86’s most bitter instruction set battles, the kind of
standards fight that makes engineers reach for the antacid:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;AMD’s Bulldozer (2011)&lt;&#x2F;strong&gt; introduced &lt;strong&gt;FMA4&lt;&#x2F;strong&gt; as a 4-operand instruction: &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#35&quot;&gt;31&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vfmaddpd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; ymm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm3&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt; ; ymm0 = ymm1 * ymm2 + ymm3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;Intel’s Haswell (2013)&lt;&#x2F;strong&gt; implemented &lt;strong&gt;FMA3&lt;&#x2F;strong&gt; as a 3-operand instruction:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vfmadd132pd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; ymm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt; ; (dest = ymm0 * ymm2 + ymm1)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;FMA4 and FMA3 are incompatible extensions with different operand
counts and encodings.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;AMD’s Piledriver (2012)&lt;&#x2F;strong&gt; added FMA3 support while still keeping FMA4.&lt;&#x2F;p&gt;
&lt;p&gt;Bulldozer and its successors supported FMA4,
while Haswell and later Intel CPUs supported only FMA3. AMD later
dropped FMA4 support starting with its Zen families in favor of
FMA3, and FMA4 does not appear in current AMD CPUID-reported feature
flags.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; FMA3 (Intel, Haswell+) - fused multiply-add&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vfmadd132ps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; zmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;   ; zmm0 = zmm0 * zmm2 + zmm1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vfmadd213ps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; zmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;   ; zmm0 = zmm1 * zmm0 + zmm2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vfmadd231ps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; zmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;   ; zmm0 = zmm1 * zmm2 + zmm1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; FMA4 (AMD Bulldozer) - different operand mapping&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vfmaddpd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; ymm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm3&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt; ; ymm0 = ymm1 * ymm2 + ymm3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The market fragmentation meant developers had to use CPU-specific code
paths or risk crashes. Intel’s market dominance won, FMA4 died with
Bulldozer’s failure.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#36&quot;&gt;32&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; AMD eventually added FMA3 support in later
architectures, which is engineering-speak for “we were wrong, Intel
won, let’s just copy them.”&lt;&#x2F;p&gt;
&lt;p&gt;On a personal note. I much prefer the FMA4 syntax, because it was non-destructive.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-technical-core-of-the-dispute&quot;&gt;The Technical Core of the Dispute&lt;&#x2F;h3&gt;
&lt;p&gt;The conflict wasn’t just about operand ordering; it was about register destruction.&lt;&#x2F;p&gt;
&lt;p&gt;Fused Multiply-Add requires three input values (A×B+CA×B+C).
To store the result in a fourth register (DD) requires a
4-operand instruction.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;AMD’s FMA4&lt;&#x2F;strong&gt; introduced a special extension to VEX allowing 4 distinct operands.
It was fully &lt;em&gt;non-destructive&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Intel’s FMA3&lt;&#x2F;strong&gt; stuck to the standard VEX limit of 3 operands.
To make the math work, the destination register must also serve as the third input.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; AMD FMA4 (Non-Destructive)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vfmaddpd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; ymm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm3&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; ymm0 = ymm1 * ymm2 + ymm3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; All inputs (ymm1, ymm2, ymm3) are preserved.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; Intel FMA3 (Destructive)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vfmadd231pd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; ymm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ymm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;     ; ymm0 = ymm1 * ymm2 + ymm0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; The original value of ymm0 is destroyed (used as the addend).&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;the-xeon-phi-gpcore-connection&quot;&gt;The Xeon Phi “GPCORE” Connection&lt;&#x2F;h3&gt;
&lt;p&gt;The Xeon Phi’s core architecture (codenamed “GPCORE”) was a radical
departure from Intel’s mainstream cores. Designed by a separate team
working on the Larrabee research project, it featured:&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#37&quot;&gt;33&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Wide but shallow pipelines&lt;&#x2F;strong&gt; optimized for throughput over latency&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;512-bit vector units&lt;&#x2F;strong&gt; as the primary execution resource&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;No out-of-order execution&lt;&#x2F;strong&gt; in early versions (Knights Corner)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;why-512-bits-the-xeon-phi-imperative&quot;&gt;Why 512 Bits? The Xeon Phi Imperative&lt;&#x2F;h3&gt;
&lt;p&gt;Intel’s drive to 512-bit vectors wasn’t primarily about mainstream
CPUs, it was about Xeon Phi and competing with GPUs in HPC. The Knights
Landing (KNL) project, announced at ISC 2014, was the first to implement
AVX-512, targeting 3+ TFLOPS of double-precision peak theoretical
performance per single node.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#38&quot;&gt;34&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;These customers represented a small percentage of Intel’s revenue
but demanded disproportionate engineering investment. Like the
“Trinity” Supercomputer at NNSA (National Nuclear Security Administration).
$174 million deal awarded to Cray that will feature Haswell and Knights Landing
&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#38&quot;&gt;34&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; (Note: 13th slide)&lt;&#x2F;p&gt;
&lt;p&gt;This leads me to believe Intel sales teams used their contracts to
justify AVX-512 development internally.
High-value enterprise customers often get special treatment,
even when they represent a tiny fraction of the overall market population wise…&lt;&#x2F;p&gt;
&lt;p&gt;I also miss affordable RAM.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;who-demanded-512-bit-simd&quot;&gt;Who Demanded 512-Bit SIMD?&lt;&#x2F;h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;National Labs&lt;&#x2F;strong&gt; (DOE) - Required for TOP500 supercomputer
competitiveness against NVIDIA GPUs&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Weather Modeling Agencies&lt;&#x2F;strong&gt; (NOAA, ECMWF) - Needed 2x+ vector
throughput for atmospheric simulations&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Quantitative Finance&lt;&#x2F;strong&gt; - HFT firms paying premium for any FP
performance edge&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Oil &amp;amp; Gas&lt;&#x2F;strong&gt; - Seismic processing workloads that were GPU-prohibitive
due to data transfer costs&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;These were the customers who would call Intel and say “we’ll give you
$50 million if you add this instruction.” And Intel, being a
corporation, would say “yes, absolutely, right away, here’s an entire
engineering team.”&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;part-v-the-avx-512-nightmare-2016-2026&quot;&gt;Part V: The AVX-512 Nightmare (2016-2026)&lt;&#x2F;h2&gt;
&lt;blockquote&gt;
&lt;p&gt;“To know your Enemy, you must become your Enemy.” &lt;br &#x2F;&gt;
    – Sun Tzu, The Art of War.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;the-power-virus-reality&quot;&gt;The Power Virus Reality&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;strong&gt;Travis Downs’&lt;&#x2F;strong&gt; detailed analysis revealed that AVX-512 on Skylake-X
caused massive &lt;strong&gt;license-based downclocking&lt;&#x2F;strong&gt;.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#39&quot;&gt;35&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;License Level&lt;&#x2F;th&gt;&lt;th&gt;Base Frequency&lt;&#x2F;th&gt;&lt;th&gt;Notes&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;L0 (Non-AVX)&lt;&#x2F;td&gt;&lt;td&gt;3.2 GHz&lt;&#x2F;td&gt;&lt;td&gt;Standard operation&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;L1 (AVX)&lt;&#x2F;td&gt;&lt;td&gt;2.8 GHz&lt;&#x2F;td&gt;&lt;td&gt;12.5% reduction&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;L2 (AVX-512)&lt;&#x2F;td&gt;&lt;td&gt;2.4 GHz&lt;&#x2F;td&gt;&lt;td&gt;25% reduction from base&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;&lt;strong&gt;The thermal&#x2F;power calculus&lt;&#x2F;strong&gt;: 512-bit SIMD units consumed approximately
&lt;strong&gt;3x the power&lt;&#x2F;strong&gt; of 256-bit units at the same frequency. Intel had to
either:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Downclock&lt;&#x2F;strong&gt; when 512-bit instructions executed (their choice)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Increase TDP&lt;&#x2F;strong&gt; significantly (unacceptable for mainstream)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Disable cores&lt;&#x2F;strong&gt; to maintain power budget (theoretical, never
implemented)&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;They chose option 1, which meant your $500 processor would deliberately
slow itself down if you even dared to use its most advanced features.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;linus-torvalds-famous-rant-july-2020&quot;&gt;Linus Torvalds’ Famous Rant (July 2020)&lt;&#x2F;h3&gt;
&lt;p&gt;Linus Torvalds, the creator of Linux, is not known for holding back. In
July 2020, he delivered one of the great tech rants of all time:&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#40&quot;&gt;36&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“I want my power limits to be reached with regular integer code, not
with some AVX512 power virus that takes away top frequency (because
people ended up using it for memcpy!) and takes away cores (because
those useless garbage units take up space).”&lt;&#x2F;p&gt;
&lt;p&gt;“I hope AVX512 dies a painful death, and that Intel starts fixing
real problems instead of trying to create magic instructions to then
create benchmarks that they can look good on.”&lt;&#x2F;p&gt;
&lt;p&gt;“I’d much rather see that transistor budget used on other things
that are much more relevant. Even if it’s still FP math (in the GPU,
rather than AVX512). Or just give me more cores (with good
single-thread performance, but without the garbage like AVX512) like
AMD did.”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;the-avx-512-fragmentation-problem&quot;&gt;The AVX-512 Fragmentation Problem&lt;&#x2F;h3&gt;
&lt;p&gt;AVX-512 became a “family of instruction sets” rather than a single
standard:&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#42&quot;&gt;37&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Knights Landing (2016)&lt;&#x2F;strong&gt;: AVX-512-BW, CD, ER, PF (no main CPU features)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Skylake-X (2017)&lt;&#x2F;strong&gt;: AVX-512-F, CD, BW, DQ, VL, etc.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Cannon Lake (2018)&lt;&#x2F;strong&gt;: Added AVX-512-VNNI (AI&#x2F;ML instructions)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Ice Lake (2019)&lt;&#x2F;strong&gt;: Better frequency scaling, added BF16&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Alder Lake (2022)&lt;&#x2F;strong&gt;: &lt;strong&gt;Disabled entirely&lt;&#x2F;strong&gt; due to hybrid architecture
conflicts&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;It got to the point where you couldn’t even tell which AVX-512 features
a processor supported without looking at the spec sheet.
Intel had essentially created an instruction set that was different on every chip.
This is the opposite of standardization.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;imgs.xkcd.com&#x2F;comics&#x2F;standards.png&quot; alt=&quot;mandatory xkcd standards&quot; &#x2F;&gt;   &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#41&quot;&gt;38&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;why-alder-lake-killed-it&quot;&gt;Why Alder Lake Killed It&lt;&#x2F;h3&gt;
&lt;p&gt;Intel’s hybrid architecture had Performance-cores (Golden Cove) and
Efficiency-cores (Gracemont). Only P-cores had 512-bit units, E-cores
maxed at 256-bit. This caused:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scheduling nightmares&lt;&#x2F;strong&gt; for the OS thread director.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Power management conflicts&lt;&#x2F;strong&gt; between core types&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Customer confusion&lt;&#x2F;strong&gt; over which instructions would work where&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Intel’s solution: &lt;strong&gt;Fuse it off in silicon&lt;&#x2F;strong&gt; to prevent BIOS workarounds,
then create AVX10 as a unified replacement.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#43&quot;&gt;39&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; This is what happens
when you build a feature so complex that even the company that created
it can’t figure out how to make it work across different product lines.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;why-amd-resisted-and-how-they-finally-won-for-now&quot;&gt;Why AMD Resisted (And How They Finally Won, for now)&lt;&#x2F;h3&gt;
&lt;blockquote&gt;
&lt;p&gt;AMD’s position (&lt;em&gt;2017-2021&lt;&#x2F;em&gt;): &lt;br &#x2F;&gt;
“We’re not rushing to add features that make Intel’s chips throttle.” &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#45&quot;&gt;40&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Zen 4 Breakthrough (&lt;em&gt;2022&lt;&#x2F;em&gt;)&lt;&#x2F;strong&gt;: When AMD finally added AVX-512 in Ryzen 7000,
they did it with a stroke of genius: “double-pumping.”
Instead of building massive 512-bit execution units that generated enormous heat,
they executed 512-bit instructions using two cycles on their existing 256-bit units.&lt;&#x2F;p&gt;
&lt;p&gt;It was simply logical. Developers got the instruction set support they wanted
(&lt;em&gt;VNNI&lt;&#x2F;em&gt;, &lt;em&gt;BFloat16&lt;&#x2F;em&gt;), but the processors didn’t downclock. This approach
avoided the “garbage” power penalties that had plagued Intel’s implementation.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#46&quot;&gt;41&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Zen 5’s Power Play (&lt;em&gt;2024&lt;&#x2F;em&gt;)&lt;&#x2F;strong&gt;: With Ryzen 9000, AMD finally moved to a true full-width
512-bit datapath. While this doubled raw throughput, it brought the laws
of physics back into play—lighting up a 512-bit wire simply generates more heat
than a 256-bit one. While it avoided the catastrophic downclocking of Intel’s
Skylake era, it forced AMD to manage power density much more aggressively than
with Zen 4.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#46&quot;&gt;41&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;raja-koduri-s-defense-august-2020&quot;&gt;Raja Koduri’s Defense (August 2020)&lt;&#x2F;h3&gt;
&lt;blockquote&gt;
&lt;p&gt;“There are people who are doing real work with AVX-512. It’s not just
benchmarks. And it’s not going away.”&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#47&quot;&gt;42&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Intel’s Raja Koduri (who would later return to Intel after adventures
at Apple and Samsung) tried to defend AVX-512 against Torvalds’
criticism. The subtext seemed to be: “Linus, you don’t understand.
National labs and AI researchers actually use this stuff!”&lt;&#x2F;p&gt;
&lt;p&gt;Linus’ response was not diplomatic, but it was memorable.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-2022-resolution-intel-finally-surrenders&quot;&gt;The 2022 Resolution: Intel Finally Surrenders&lt;&#x2F;h3&gt;
&lt;p&gt;In January 2022, the debate reached its inevitable conclusion. Intel
disabled AVX-512 on Alder Lake processors, not through a BIOS option,
but by fusing it off in silicon. The official rationale was hybrid
architecture conflicts: Performance-cores had 512-bit units while
Efficiency-cores maxed at 256-bit, creating scheduling nightmares for
the OS thread director.&lt;&#x2F;p&gt;
&lt;p&gt;But the subtext was clear: Linus &amp;amp; common sense had won. The “power virus” that
downclocked entire processors lineage, the transistor budget consumed by
features most developers never used or didn’t even know the names of,
the fragmentation across SKUs, all of it was quietly retired.&lt;&#x2F;p&gt;
&lt;p&gt;As Linus noted in November 2022, neural net inference is one of the
few legitimate use cases for AVX-512. For everything else, from
video encoding to database operations to general-purpose computing,
the costs outweighed the benefits.&lt;&#x2F;p&gt;
&lt;p&gt;And when in an age where literally every personal computer either has
an integrated GPU or an external GPU, CPU SIMD seems like a weird
transitional phase. It definitely has it’s uses but is applied in
contexts where its costs outweigh benefit.&lt;&#x2F;p&gt;
&lt;p&gt;The resolution wasn’t a technical decision. It was a market decision.
Intel’s hybrid architecture demanded coherent vector support across
all cores. AVX-512 couldn’t provide that. So it is being slowly removed.
Just like Itanium, x87 and the x86 we used to know.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-fragmentation-spiral&quot;&gt;The Fragmentation Spiral&lt;&#x2F;h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;2013-2016&lt;&#x2F;strong&gt;: Intel splits AVX-512 across incompatible implementations&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;2017-2021&lt;&#x2F;strong&gt;: Different SKUs have different feature subsets
(bifurcation strategy)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;2022&lt;&#x2F;strong&gt;: Alder Lake fuses off AVX-512 entirely&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;2023&lt;&#x2F;strong&gt;: Intel announces AVX10 to unify the mess &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#48&quot;&gt;43&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;2026&lt;&#x2F;strong&gt;: Nova Lake with AVX10.2 targets coherent 512-bit support across
all cores (confirmed November 2025)&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The irony was that AVX-512 was designed to unify Intel’s vector strategy.
Instead, it became the most fragmented instruction set extension in
x86-64 history. Requiring multiple replacement specifications to fix the
damage. This is the equivalent of creating a problem so complex that
you need to create a new solution just to solve the original solution’s
problems.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;lessons-learned&quot;&gt;Lessons Learned&lt;&#x2F;h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Backward compatibility drives architecture&lt;&#x2F;strong&gt;: The register
aliasing decision haunted MMX for years, but it enabled rapid
adoption without OS changes.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Marketing matters as much as engineering&lt;&#x2F;strong&gt;: Intel’s aggressive
MMX marketing, despite modest real-world gains, established SIMD
as essential for consumer processors.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Competition accelerates innovation&lt;&#x2F;strong&gt;: AMD’s 3DNow! forced Intel
to add FP SIMD capabilities years earlier than planned. The FMA
controversy showed how fragmented standards hurt developers.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compromises become permanent&lt;&#x2F;strong&gt;: Intel’s “sorta” 128-bit SSE
implementation influenced x86 SIMD architecture for a decade.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Customer requirements can override engineering sanity&lt;&#x2F;strong&gt;: AVX-512
was pushed by a small percentage of customers but created massive
fragmentation and power issues for everyone.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fragmentation has costs&lt;&#x2F;strong&gt;: AVX-512’s bifurcation across SKUs and
eventual disablement in hybrid architectures shows the danger of
over-engineering for edge cases.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sometimes the market decides&lt;&#x2F;strong&gt;: AMD won the FMA fight not through
technical superiority, but through market dominance. The best
instruction set is the one everyone actually uses.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-legacy&quot;&gt;The Legacy&lt;&#x2F;h2&gt;
&lt;p&gt;The engineers who built x86 SIMD made decisions that shaped computing
for decades, often under intense pressure and uncertainty. Their legacy
is in every video encode, 3D render, AI inference, and scientific
simulation happening on x86 processors today.&lt;&#x2F;p&gt;
&lt;p&gt;The battle continues with AVX10, but the lessons from MMX through
AVX-512 remain: &lt;strong&gt;architecture decisions made in conference rooms in
&lt;span class=&quot;map-inline&quot;&gt;
    &lt;button
        class=&quot;map-trigger&quot;
        type=&quot;button&quot;
        data-lat=&quot;32.7940&quot;
        data-lon=&quot;34.9896&quot;
        data-zoom=&quot;10&quot;
        data-title=&quot;Haifa, Israel&quot;
    &gt;
        Haifa
    &lt;&#x2F;button&gt;

    &lt;span class=&quot;map-popover&quot;&gt;
        &lt;span class=&quot;map-header&quot;&gt;Haifa, Israel&lt;&#x2F;span&gt;
        &lt;span class=&quot;map&quot;&gt;&lt;&#x2F;span&gt;
    &lt;&#x2F;span&gt;
&lt;&#x2F;span&gt;
, &lt;span class=&quot;map-inline&quot;&gt;
    &lt;button
        class=&quot;map-trigger&quot;
        type=&quot;button&quot;
        data-lat=&quot;37.3541&quot;
        data-lon=&quot;-121.9552&quot;
        data-zoom=&quot;11&quot;
        data-title=&quot;Santa Clara, CA&quot;
    &gt;
        Santa Clara
    &lt;&#x2F;button&gt;

    &lt;span class=&quot;map-popover&quot;&gt;
        &lt;span class=&quot;map-header&quot;&gt;Santa Clara, CA&lt;&#x2F;span&gt;
        &lt;span class=&quot;map&quot;&gt;&lt;&#x2F;span&gt;
    &lt;&#x2F;span&gt;
&lt;&#x2F;span&gt;
, and &lt;span class=&quot;map-inline&quot;&gt;
    &lt;button
        class=&quot;map-trigger&quot;
        type=&quot;button&quot;
        data-lat=&quot;30.2672&quot;
        data-lon=&quot;-97.7431&quot;
        data-zoom=&quot;10&quot;
        data-title=&quot;Austin, TX&quot;
    &gt;
        Austin
    &lt;&#x2F;button&gt;

    &lt;span class=&quot;map-popover&quot;&gt;
        &lt;span class=&quot;map-header&quot;&gt;Austin, TX&lt;&#x2F;span&gt;
        &lt;span class=&quot;map&quot;&gt;&lt;&#x2F;span&gt;
    &lt;&#x2F;span&gt;
&lt;&#x2F;span&gt;
 echo through decades of computing&lt;&#x2F;strong&gt;. The
next chapter is being written now, will AVX10 finally unify Intel’s
fractured vector strategy, or will history repeat itself?&lt;&#x2F;p&gt;
&lt;p&gt;One thing is certain: somewhere, right now, an engineer is making a
decision that will seem brilliant, stupid, or utterly incomprehensible
to programmers thirty years from now. That’s the nature of this
business. And honestly? That’s what makes it fun.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;“I’d much rather see that transistor budget used on other things
that are much more relevant.”&lt;&#x2F;em&gt; &lt;br &#x2F;&gt;
    – Linus Torvalds, 2020 &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#40&quot;&gt;36&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;appendix-a-x86-simd-syntax-reference&quot;&gt;Appendix A: x86 SIMD Syntax Reference&lt;&#x2F;h2&gt;
&lt;p&gt;This appendix is taken from the documents which can be found &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;developer&#x2F;articles&#x2F;technical&#x2F;intel-sdm.htm&quot;&gt;here&lt;&#x2F;a&gt; with the best of my ability.
If you see any problems here, please don’t hesitate to contact me.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;a-1-register-naming-conventions&quot;&gt;A.1 Register Naming Conventions&lt;&#x2F;h3&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Extension&lt;&#x2F;th&gt;&lt;th&gt;Registers&lt;&#x2F;th&gt;&lt;th&gt;Width&lt;&#x2F;th&gt;&lt;th&gt;Naming Scheme&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;MMX&lt;&#x2F;td&gt;&lt;td&gt;MM0-MM7&lt;&#x2F;td&gt;&lt;td&gt;64-bit&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;MM&amp;lt;n&amp;gt;&lt;&#x2F;code&gt; where n = 0-7&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;SSE&lt;&#x2F;td&gt;&lt;td&gt;XMM0-XMM15&lt;&#x2F;td&gt;&lt;td&gt;128-bit&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;XMM&amp;lt;n&amp;gt;&lt;&#x2F;code&gt; where n = 0-15&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;AVX&lt;&#x2F;td&gt;&lt;td&gt;YMM0-YMM15&lt;&#x2F;td&gt;&lt;td&gt;256-bit&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;YMM&amp;lt;n&amp;gt;&lt;&#x2F;code&gt; (upper 128 of ZMM)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;AVX-512&lt;&#x2F;td&gt;&lt;td&gt;ZMM0-ZMM31&lt;&#x2F;td&gt;&lt;td&gt;512-bit&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;ZMM&amp;lt;n&amp;gt;&lt;&#x2F;code&gt; (full register)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;a-2-instruction-suffix-encoding&quot;&gt;A.2 Instruction Suffix Encoding&lt;&#x2F;h3&gt;
&lt;p&gt;The instruction suffix encodes the data type and operation:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Suffix&lt;&#x2F;th&gt;&lt;th&gt;Meaning&lt;&#x2F;th&gt;&lt;th&gt;Example&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;S&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Signed integer&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;PMOVSXBD&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;U&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Unsigned integer&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;PADDUSB&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;B&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Byte (8-bit)&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;PADDB&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;W&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Word (16-bit)&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;PADDW&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;D&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Doubleword (32-bit)&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;PADDD&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;Q&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Quadword (64-bit)&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;PADDQ&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;S&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Single-precision FP&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;ADDPS&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;D&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Double-precision FP&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;ADDPD&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;a-3-assembly-syntax-variations&quot;&gt;A.3 Assembly Syntax Variations&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;strong&gt;Intel Syntax&lt;&#x2F;strong&gt; (used throughout this document):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vaddps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; zmm0&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt;k1&lt;&#x2F;span&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;span&gt;{&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt;z&lt;&#x2F;span&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;    ; ZMM0 = ZMM1 + ZMM2 with masking&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vmovups&lt;&#x2F;span&gt;&lt;span&gt; zmmword ptr [&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;rax&lt;&#x2F;span&gt;&lt;span&gt;], &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm3&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;    ; Store packed singles&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;AT&amp;amp;T Syntax&lt;&#x2F;strong&gt; (GNU assembler):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vaddps&lt;&#x2F;span&gt;&lt;span&gt; %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm2&lt;&#x2F;span&gt;&lt;span&gt;, %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm1&lt;&#x2F;span&gt;&lt;span&gt;, %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm0&lt;&#x2F;span&gt;&lt;span&gt;{%k1}&lt;&#x2F;span&gt;&lt;span&gt;{&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt;z&lt;&#x2F;span&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vmovups&lt;&#x2F;span&gt;&lt;span&gt; %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;64&lt;&#x2F;span&gt;&lt;span&gt;(%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;rax&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;a-4-evex-vex-encoding-fields&quot;&gt;A.4 EVEX&#x2F;VEX Encoding Fields&lt;&#x2F;h3&gt;
&lt;p&gt;Modern AVX-512 uses EVEX encoding with four modifier bytes:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Field&lt;&#x2F;th&gt;&lt;th&gt;Bits&lt;&#x2F;th&gt;&lt;th&gt;Purpose&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;pp&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;Opcode extension (00 = no extension)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;mm&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;VEX.mmmmm equivalent&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;W&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;Vector width (0 = 128&#x2F;256, 1 = 512)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;vvvv&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;Destination register specifier&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;aaa&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;{k}{z} mask register (000 = no mask)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;B&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;Broadcast&#x2F;Round control&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;R&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;Register specifier extension&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The complete encoding follows:&lt;&#x2F;p&gt;
&lt;p&gt;$$
\text{EVEX} = 0x62 ;\Vert; \text{RR}{}^\prime\text{B} ;\Vert;
\text{vvvv} ;\Vert; \text{aaa}
$$&lt;&#x2F;p&gt;
&lt;h3 id=&quot;a-5-intrinsic-type-mappings&quot;&gt;A.5 Intrinsic Type Mappings&lt;&#x2F;h3&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;SIMD Type&lt;&#x2F;th&gt;&lt;th&gt;C&#x2F;C++ Intrinsic&lt;&#x2F;th&gt;&lt;th&gt;Width (bits)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;__m64&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;MMX&lt;&#x2F;td&gt;&lt;td&gt;64&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;__m128&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;SSE&lt;&#x2F;td&gt;&lt;td&gt;128&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;__m128d&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;SSE (double)&lt;&#x2F;td&gt;&lt;td&gt;128&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;__m256&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;AVX&lt;&#x2F;td&gt;&lt;td&gt;256&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;__m256d&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;AVX (double)&lt;&#x2F;td&gt;&lt;td&gt;256&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;__m512&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;AVX-512&lt;&#x2F;td&gt;&lt;td&gt;512&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;__m512d&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;AVX-512 (double)&lt;&#x2F;td&gt;&lt;td&gt;512&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;a-6-common-operation-mnemonics&quot;&gt;A.6 Common Operation Mnemonics&lt;&#x2F;h3&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Category&lt;&#x2F;th&gt;&lt;th&gt;Instructions&lt;&#x2F;th&gt;&lt;th&gt;Description&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Arithmetic&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;PADD*&lt;&#x2F;code&gt;, &lt;code&gt;PSUB*&lt;&#x2F;code&gt;, &lt;code&gt;PMUL*&lt;&#x2F;code&gt;, &lt;code&gt;PMADD*&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Integer arithmetic&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;FP Arithmetic&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;ADDPS&lt;&#x2F;code&gt;, &lt;code&gt;MULPS&lt;&#x2F;code&gt;, &lt;code&gt;DIVPS&lt;&#x2F;code&gt;, &lt;code&gt;SQRTPS&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Single-precision FP&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Compare&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;PCMPEQ*&lt;&#x2F;code&gt;, &lt;code&gt;PCMPGT*&lt;&#x2F;code&gt;, &lt;code&gt;CMPPS&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Equality&#x2F;greater-than&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Logical&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;PAND&lt;&#x2F;code&gt;, &lt;code&gt;POR&lt;&#x2F;code&gt;, &lt;code&gt;PXOR&lt;&#x2F;code&gt;, &lt;code&gt;ANDPS&lt;&#x2F;code&gt;, &lt;code&gt;ORPS&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Bitwise operations&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Shuffle&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;PSHUFLW&lt;&#x2F;code&gt;, &lt;code&gt;SHUFPS&lt;&#x2F;code&gt;, &lt;code&gt;VPERM*&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Lane manipulation&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Load&#x2F;Store&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;MOVAPS&lt;&#x2F;code&gt;, &lt;code&gt;MOVUPD&lt;&#x2F;code&gt;, &lt;code&gt;VBROADCAST*&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Memory transfers&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Convert&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;CVTDQ2PS&lt;&#x2F;code&gt;, &lt;code&gt;CVTPS2DQ&lt;&#x2F;code&gt;, &lt;code&gt;VCVT*&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Type conversion&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Mask&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;KAND&lt;&#x2F;code&gt;, &lt;code&gt;KOR&lt;&#x2F;code&gt;, &lt;code&gt;KXNOR&lt;&#x2F;code&gt;, &lt;code&gt;KNOT&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Mask register ops&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;a-7-mask-register-operations&quot;&gt;A.7 Mask Register Operations&lt;&#x2F;h3&gt;
&lt;p&gt;AVX-512 introduced dedicated mask registers (k0-k7):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; Merging mask: k0 not used for merge&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vaddps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; zmm0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm2&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;       ; Normal operation&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; Zeroing mask: k1 zeroes where mask bit = 0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vaddps&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; zmm0&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt;k1&lt;&#x2F;span&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;span&gt;{&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt;z&lt;&#x2F;span&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm4&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-comment&quot;&gt;; Arithmetic mask: k2 used for conditional selection&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;vpaddd&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; zmm5&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt;k2&lt;&#x2F;span&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm6&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;zmm7&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;  ; zmm5[i] = mask[i] ? zmm6[i] + zmm7[i] : zmm5[i]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The mask value at position $i$ is computed as:&lt;&#x2F;p&gt;
&lt;p&gt;$$
\text{mask}[i] = \begin{cases}
1 &amp;amp; \text{if } \text{cond}(\text{src1}[i], \text{src2}[i]) \
0 &amp;amp; \text{otherwise}
\end{cases}
$$&lt;&#x2F;p&gt;
&lt;h3 id=&quot;a-8-lane-concepts-in-simd&quot;&gt;A.8 Lane Concepts in SIMD&lt;&#x2F;h3&gt;
&lt;p&gt;A “lane” is a sub-vector within a wider register:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ZMM31 (512 bits) = 8 lanes of 64 bits each&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    |-----|-----|-----|-----|-----|-----|-----|-----|&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    |  0  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;YMM15 (256 bits) = 4 lanes of 64 bits each&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    |-----------|-----------|-----------|-----------|&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    |     0     |     1     |     2     |     3     |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;XMM0  (128 bits) = 2 lanes of 64 bits each&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    |-----------------------|-----------------------|&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    |           0           |           1           |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Lane-crossing operations require special handling and may incur
performance penalties.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;appendix-b-assembly-code-analysis&quot;&gt;Appendix B: Assembly Code Analysis&lt;&#x2F;h2&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;code1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;9&lt;&#x2F;sup&gt;
&lt;p&gt;Appendix B&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;&lt;strong&gt;Experimental Context:&lt;&#x2F;strong&gt; The assembly output examined in this appendix was generated by compiling the source C code with GCC 2.7.2.3 on a Debian Potato (2.2) system running under qemu-system-i386 virtualization (QEMU emulator version 9.2.4 (qemu-9.2.4-2.fc42)). The host system was an 11th Gen Intel(R) Core(TM) i7-11370H processor. This experimental setup recreates the 1997-era GCC compilation environment while running on modern hardware.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;b-1-source-code-and-compiler-context&quot;&gt;B.1 Source Code and Compiler Context&lt;&#x2F;h3&gt;
&lt;p&gt;The following analysis examines the GCC 2.7.2.3 assembly output for the integer vector addition function discussed in Section I. The source C code was:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;c&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment&quot;&gt;&#x2F;&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt; Function prototype provided for compatibility with K&amp;amp;R-era compilers.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;void&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name&quot;&gt; add_i32&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; dest&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt; const&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; a&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt; const&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; b&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; n&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;void&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name&quot;&gt; add_i32&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; dest&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt; const&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; a&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt; const&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; b&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage z-type&quot;&gt; int&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; n&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;{&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage z-type&quot;&gt;    int&lt;&#x2F;span&gt;&lt;span&gt; i&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    for&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span&gt;i &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt; 0&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;span&gt; i &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span&gt; n&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt; ++&lt;&#x2F;span&gt;&lt;span&gt;i&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable&quot;&gt;        dest&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span&gt;i&lt;&#x2F;span&gt;&lt;span&gt;]&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; a&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span&gt;i&lt;&#x2F;span&gt;&lt;span&gt;]&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt; +&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable&quot;&gt; b&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span&gt;i&lt;&#x2F;span&gt;&lt;span&gt;]&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Compiled with: &lt;code&gt;gcc -O2 -S test.c&lt;&#x2F;code&gt; (GCC 2.7.2.3, 1996-era)&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Compiler Version Context:&lt;&#x2F;strong&gt; GCC 2.7.2.3 was released August 20, 1997, during the Pentium MMX era. This version predates any MMX intrinsic support in GCC. MMX intrinsics first appeared in GCC 3.1 (2002), and auto-vectorization was not added until GCC 4.0 (2005) &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#49&quot;&gt;44&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;b-2-generated-assembly-analysis&quot;&gt;B.2 Generated Assembly Analysis&lt;&#x2F;h3&gt;
&lt;p&gt;The complete assembly output (&lt;code&gt;assets&#x2F;out.s&lt;&#x2F;code&gt;) consists of 40 lines. The following analysis provides a line-by-line examination with verification against contemporary documentation.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Header Section (Lines 1-7):&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Line&lt;&#x2F;th&gt;&lt;th&gt;Assembly&lt;&#x2F;th&gt;&lt;th&gt;Analysis&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;.file &quot;test.c&quot;&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Debug info directive, specifies source filename&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;.version &quot;01.01&quot;&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;GAS assembler version string&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;gcc2_compiled.:&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;strong&gt;VERIFIED:&lt;&#x2F;strong&gt; Valid GNU assembly identifier. The trailing dot is part of the symbol name, used by libg++ to identify GCC-compiled objects &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#52&quot;&gt;45&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;.text&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Code section directive&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;.align 4&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;16-byte alignment (2^4 = 16). Correct for Pentium Pro&#x2F;Pentium II instruction fetch optimization&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;.globl add_i32&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Exports symbol globally&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;.type add_i32,@function&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;ELF symbol type directive for debug info&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;&lt;strong&gt;Prologue and Parameter Loading (Lines 8-17):&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The function follows the &lt;strong&gt;System V i386 ABI&lt;&#x2F;strong&gt; with cdecl calling convention &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#53&quot;&gt;46&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name&quot;&gt;add_i32&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name&quot;&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    pushl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebp&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;           ; Save caller&amp;#39;s frame pointer&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    movl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;esp&lt;&#x2F;span&gt;&lt;span&gt;,%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebp&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;       ; Establish new frame pointer&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    pushl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;edi&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;           ; Save callee-saved registers&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    pushl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;esi&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    pushl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebx&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    movl &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;(%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebp&lt;&#x2F;span&gt;&lt;span&gt;),%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;esi&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;    ; %esi = dest (arg 1)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    movl &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;12&lt;&#x2F;span&gt;&lt;span&gt;(%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebp&lt;&#x2F;span&gt;&lt;span&gt;),%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebx&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;   ; %ebx = a (arg 2)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    movl &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;(%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebp&lt;&#x2F;span&gt;&lt;span&gt;),%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ecx&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;   ; %ecx = b (arg 3)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    movl &lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;20&lt;&#x2F;span&gt;&lt;span&gt;(%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebp&lt;&#x2F;span&gt;&lt;span&gt;),%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;edx&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;   ; %edx = n (arg 4)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;Stack Offset Verification:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;After &lt;code&gt;pushl %ebp&lt;&#x2F;code&gt;: 4(%ebp) = saved %ebp&lt;&#x2F;li&gt;
&lt;li&gt;8(%ebp) = first argument (dest)&lt;&#x2F;li&gt;
&lt;li&gt;12(%ebp) = second argument (a)&lt;&#x2F;li&gt;
&lt;li&gt;16(%ebp) = third argument (b)&lt;&#x2F;li&gt;
&lt;li&gt;20(%ebp) = fourth argument (n)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;Register Allocation:&lt;&#x2F;strong&gt; The compiler saves %edi, %esi, %ebx as callee-saved registers per ABI. This is conservative. Only %edi is truly modified as an accumulator.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Main Loop (Lines 22-28):&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;.&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name&quot;&gt;L5&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name&quot;&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    movl (%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebx&lt;&#x2F;span&gt;&lt;span&gt;,%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;),%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;edi&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;    ; Load a[i]: edi = *(ebx + eax*4)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    addl (%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ecx&lt;&#x2F;span&gt;&lt;span&gt;,%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;),%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;edi&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;    ; Add b[i]: edi += *(ecx + eax*4)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    movl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;edi&lt;&#x2F;span&gt;&lt;span&gt;, (%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;esi&lt;&#x2F;span&gt;&lt;span&gt;,%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;)   &lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;; Store result: *(esi + eax*4) = edi&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    incl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;                  ; Increment loop counter&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    cmpl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;edx&lt;&#x2F;span&gt;&lt;span&gt;,%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;             ; Compare with bound&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    jl&lt;&#x2F;span&gt;&lt;span&gt; .L5                     &lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;; Loop if eax &amp;lt; edx&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;Addressing Mode Analysis:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Scale factor of 4 correctly represents sizeof(int)&lt;&#x2F;li&gt;
&lt;li&gt;Base+index addressing is optimal for array access&lt;&#x2F;li&gt;
&lt;li&gt;No memory operands in instructions other than loads&#x2F;stores&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;Epilogue (Lines 29-35):&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;asm&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;.&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name&quot;&gt;L3&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name&quot;&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    leal -&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;12&lt;&#x2F;span&gt;&lt;span&gt;(%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebp&lt;&#x2F;span&gt;&lt;span&gt;),%&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;esp&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;    ; Stack pointer adjustment&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    popl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;ebx&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;              ; Restore callee-saved registers&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    popl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;esi&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    popl %&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant&quot;&gt;edi&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    leave&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;                  ; movl %ebp,%esp; popl %ebp&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    ret&lt;&#x2F;span&gt;&lt;span class=&quot;z-comment&quot;&gt;                    ; Return to caller&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;b-3-critical-finding-why-no-mmx-instructions&quot;&gt;B.3 Critical Finding: Why No MMX Instructions?&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;strong&gt;The claim in Section I—that GCC 2.7.2.3 “failed” to generate MMX code—requires clarification.&lt;&#x2F;strong&gt; What does “failed” mean in this context?&lt;&#x2F;p&gt;
&lt;p&gt;GCC 2.7.x had &lt;strong&gt;no capability to generate MMX instructions whatsoever&lt;&#x2F;strong&gt; &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#54&quot;&gt;47&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;. This was not an implementation failure but a fundamental design decision by Intel and the GCC project. Intel released MMX technology in January 1997 with aggressive marketing claims about performance improvements, yet they did not collaborate with the GCC team to ensure compiler support.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;GCC MMX Support Timeline:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Version&lt;&#x2F;th&gt;&lt;th&gt;Release&lt;&#x2F;th&gt;&lt;th&gt;MMX Support&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;GCC 2.7.x&lt;&#x2F;td&gt;&lt;td&gt;1995-1997&lt;&#x2F;td&gt;&lt;td&gt;&lt;strong&gt;None&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;GCC 3.0&lt;&#x2F;td&gt;&lt;td&gt;2001&lt;&#x2F;td&gt;&lt;td&gt;Broken &lt;code&gt;-mmmx&lt;&#x2F;code&gt; flag&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;GCC 3.1&lt;&#x2F;td&gt;&lt;td&gt;2002&lt;&#x2F;td&gt;&lt;td&gt;Initial intrinsics&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;GCC 4.0&lt;&#x2F;td&gt;&lt;td&gt;2005&lt;&#x2F;td&gt;&lt;td&gt;Auto-vectorization&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;(GCC 3.0: &lt;code&gt;-mmx&lt;&#x2F;code&gt; partial backend support; intrinsics incomplete and unstable)&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;What “Failed” Really Means:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The term “failed” implies Intel expected automatic MMX code generation from existing compilers. However, Intel did not work with the GNU Compiler Collection project to add MMX support. GCC was the primary compiler for Linux, BSD, and many embedded systems in 1997. If Intel wanted their marketing claims about MMX performance to reach everyday developers, they should have:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Provided MMX intrinsic headers and documentation to the GCC team in 1996-1997&lt;&#x2F;li&gt;
&lt;li&gt;Collaborated on machine description updates for MMX instruction selection&lt;&#x2F;li&gt;
&lt;li&gt;Ensured GCC could generate MMX code alongside proprietary compilers like Intel’s ICC&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Without this collaboration, the “marketing reality gap” widened. Intel claimed 50-700% improvements for MMX-optimized software, but developers using GCC could not achieve these speedups without writing hand-optimized assembly. The comparison in Section I between GCC output and MMX code is therefore a comparison between what Intel’s hardware could do and what Intel’s failure to work with the dominant open-source compiler allowed developers to achieve.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Evidence from GCC development history:&lt;&#x2F;strong&gt;
Richard Henderson stated in December 2004: “As mentioned in another thread, we can’t generate proper MMX&#x2F;3DNOW code ourselves. The existing intrinsics expect users to write them manually” &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#54&quot;&gt;47&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The comparison in Section I is therefore &lt;strong&gt;comparing compiler output to what would require hand-written assembly&lt;&#x2F;strong&gt;. This is not a “failure” of GCC 2.7.2.3 in the sense of a bug or regression. It is a &lt;strong&gt;fundamental limitation of 1996-era compiler technology&lt;&#x2F;strong&gt; that Intel could have addressed but chose not to.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;b-4-performance-gap-analysis&quot;&gt;B.4 Performance Gap Analysis&lt;&#x2F;h3&gt;
&lt;p&gt;Using instruction latency data from Agner Fog’s optimization manuals and Intel documentation, we can quantify the performance difference between the generated scalar code and an optimal MMX implementation &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#52&quot;&gt;45&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#55&quot;&gt;48&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Scalar Implementation (GCC output):&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Instructions per element: 5&lt;&#x2F;li&gt;
&lt;li&gt;CPI (estimated): 1.1&lt;&#x2F;li&gt;
&lt;li&gt;Cycles per element: ~5.5&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;Optimal MMX Implementation:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Instructions per 2 elements: 4 (2 times MOVQ + 2 times PADDD)&lt;&#x2F;li&gt;
&lt;li&gt;Instructions per element: 2.5&lt;&#x2F;li&gt;
&lt;li&gt;CPI (estimated): 1.0&lt;&#x2F;li&gt;
&lt;li&gt;Cycles per element: ~2.5&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;Performance Comparison (Pentium MMX, 233 MHz):&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Metric&lt;&#x2F;th&gt;&lt;th&gt;Scalar&lt;&#x2F;th&gt;&lt;th&gt;MMX&lt;&#x2F;th&gt;&lt;th&gt;Improvement&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Cycles&#x2F;element&lt;&#x2F;td&gt;&lt;td&gt;5.5&lt;&#x2F;td&gt;&lt;td&gt;2.5&lt;&#x2F;td&gt;&lt;td&gt;&lt;strong&gt;2.2x&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Elements&#x2F;sec (10M array)&lt;&#x2F;td&gt;&lt;td&gt;38.8M&lt;&#x2F;td&gt;&lt;td&gt;93.2M&lt;&#x2F;td&gt;&lt;td&gt;&lt;strong&gt;2.4x&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Memory ops per 8 elements&lt;&#x2F;td&gt;&lt;td&gt;24&lt;&#x2F;td&gt;&lt;td&gt;12&lt;&#x2F;td&gt;&lt;td&gt;&lt;strong&gt;2.0x&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Branch ops per 8 elements&lt;&#x2F;td&gt;&lt;td&gt;8&lt;&#x2F;td&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;&lt;strong&gt;2.0x&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;&lt;strong&gt;EMMS Overhead:&lt;&#x2F;strong&gt; The EMMS instruction (required after MMX code) costs 2-4 cycles. For loops processing N elements (4 per iteration), overhead is 4&#x2F;N cycles per element, negligible (0.4%) for N=1000.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;b-5-the-productivity-gap&quot;&gt;B.5 The Productivity Gap&lt;&#x2F;h3&gt;
&lt;p&gt;The 2-4x performance gap between hardware capability and compiler output in 1997 represents what we call the &lt;strong&gt;productivity gap&lt;&#x2F;strong&gt;, the difference between what SIMD hardware could do and what compilers could exploit &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#56&quot;&gt;49&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Industry Response:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Intel released MMX Technology Programmer’s Reference Manual (245794, 1997) encouraging manual intrinsics&lt;&#x2F;li&gt;
&lt;li&gt;Developers wrote assembly code directly&lt;&#x2F;li&gt;
&lt;li&gt;GCC eventually added intrinsics (GCC 3.1, 2002) and auto-vectorization (GCC 4.0, 2005)&lt;&#x2F;li&gt;
&lt;li&gt;The fundamental challenge persists: modern compilers still miss 30-50% of vectorization opportunities&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h1 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;&#x2F;h1&gt;
&lt;h2 id=&quot;18-feb-2026&quot;&gt;18 Feb 2026&lt;&#x2F;h2&gt;
&lt;p&gt;I would like to thank to &lt;em&gt;bal-e&lt;&#x2F;em&gt; and &lt;em&gt;hailey&lt;&#x2F;em&gt; from the lobste.rs forum
for noticing the AI hallucinations used to proofread this article.
I am sincerely sorry for ever letting this happen.
I shouldn’t have used them in the first place and I promise I will
never use any AI tools to proofread nor help assist my articles from
this blog hereon.&lt;&#x2F;p&gt;
&lt;p&gt;I would also like to thank &lt;em&gt;hoistbypetard&lt;&#x2F;em&gt; for inviting me to lobste.rs&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;†&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;18&lt;&#x2F;sup&gt;&lt;&#x2F;div&gt;
&lt;p&gt;I would also like to thank Peter Kankowski for pointing out the flaw in one of the examples at SSE3.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;25-feb-2026&quot;&gt;25 Feb 2026&lt;&#x2F;h2&gt;
&lt;p&gt;in &lt;em&gt;SSE2 (2000): The Pentium 4’s “Sledgehammer”&lt;&#x2F;em&gt; (now “SSE2 (2000): The Pentium 4’s and Sledgehammers”).
I mistakenly said this very wrong thing. &lt;em&gt;Intel internally called Willamette “Sledgehammer”&lt;&#x2F;em&gt;.
That is not correct, at all. I’ve now rewritten that part of the SSE2 to dive deeper on
what happened and included more context around that era. I have no idea how a mistake like that could happen.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h1 id=&quot;references&quot;&gt;References&lt;&#x2F;h1&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;Yu, Albert. “The Story of Intel MMX Technology.” &lt;em&gt;Intel
Technology Journal&lt;&#x2F;em&gt;, Q3 1997, pp. 4-13. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;dam&#x2F;www&#x2F;public&#x2F;us&#x2F;en&#x2F;documents&#x2F;research&#x2F;1997-vol01-iss-3-intel-technology-journal.pdf&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;dam&#x2F;www&#x2F;public&#x2F;us&#x2F;en&#x2F;documents&#x2F;research&#x2F;1997-vol01-iss-3-intel-technology-journal.pdf&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;2&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;Peleg, Alexander D., and Uri Weiser. “MMX Technology Extension
to the Intel Architecture.” &lt;em&gt;IEEE Micro&lt;&#x2F;em&gt;, Vol. 16, Iss. 4, pp. 42-50,
August 1996. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;safari.ethz.ch&#x2F;architecture&#x2F;fall2020&#x2F;lib&#x2F;exe&#x2F;fetch.php?media=mmx_technology_1996.pdf&quot;&gt;https:&#x2F;&#x2F;safari.ethz.ch&#x2F;architecture&#x2F;fall2020&#x2F;lib&#x2F;exe&#x2F;fetch.php?media=mmx_technology_1996.pdf&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;5&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;3&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. &lt;em&gt;MMX Technology Architecture Overview&lt;&#x2F;em&gt;.
Order Number 243081-002, March 1996. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.ardent-tool.com&#x2F;CPU&#x2F;docs&#x2F;Intel&#x2F;MMX&#x2F;243081-002.pdf&quot;&gt;https:&#x2F;&#x2F;www.ardent-tool.com&#x2F;CPU&#x2F;docs&#x2F;Intel&#x2F;MMX&#x2F;243081-002.pdf&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;6&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;4&lt;&#x2F;sup&gt;
&lt;p&gt;Fog, Agner. “Optimizing subroutines in assembly language.”
&lt;em&gt;Agner Fog’s Optimization Manuals&lt;&#x2F;em&gt;, 2024. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.agner.org&#x2F;optimize&#x2F;&quot;&gt;https:&#x2F;&#x2F;www.agner.org&#x2F;optimize&#x2F;&lt;&#x2F;a&gt;.
Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;7&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;5&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. “MMX Trademark Registration.”
&lt;em&gt;USPTO&lt;&#x2F;em&gt;, 1997. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;tsdr.uspto.gov&#x2F;&quot;&gt;https:&#x2F;&#x2F;tsdr.uspto.gov&#x2F;&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;8&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;6&lt;&#x2F;sup&gt;
&lt;p&gt;“Intel sues Cyrix, AMD over MMX name.” &lt;em&gt;CNET&lt;&#x2F;em&gt;, March 17, 1997.
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.cnet.com&#x2F;tech&#x2F;services-and-software&#x2F;intel-sues-over-mmx-trademark&#x2F;&quot;&gt;https:&#x2F;&#x2F;www.cnet.com&#x2F;tech&#x2F;services-and-software&#x2F;intel-sues-over-mmx-trademark&#x2F;&lt;&#x2F;a&gt;.
Accessed February 18, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;9&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;7&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Launches New Ad Campaign For MMX™ Technology That Puts The Fun In Computing. January 1997. Archived at: &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20260218080609&#x2F;https:&#x2F;&#x2F;www.intel.com&#x2F;pressroom&#x2F;archive&#x2F;releases&#x2F;1997&#x2F;CN12297A.HTM&quot;&gt;https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20260218080609&#x2F;https:&#x2F;&#x2F;www.intel.com&#x2F;pressroom&#x2F;archive&#x2F;releases&#x2F;1997&#x2F;CN12297A.HTM&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;10&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;8&lt;&#x2F;sup&gt;
&lt;p&gt;Stokes, Jon. “3 1&#x2F;2 SIMD Architectures.” &lt;em&gt;Ars Technica&lt;&#x2F;em&gt;, March 1, 2000.
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arstechnica.com&#x2F;features&#x2F;2000&#x2F;03&#x2F;simd&#x2F;&quot;&gt;https:&#x2F;&#x2F;arstechnica.com&#x2F;features&#x2F;2000&#x2F;03&#x2F;simd&#x2F;&lt;&#x2F;a&gt;. Accessed
January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;11&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;10&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. “Intel Launches Pentium III Processor.”
&lt;em&gt;Intel Press Release&lt;&#x2F;em&gt;, February 26, 1999. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;pressroom&#x2F;archive&#x2F;releases&#x2F;1999&#x2F;dp022699.htm&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;pressroom&#x2F;archive&#x2F;releases&#x2F;1999&#x2F;dp022699.htm&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;12&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;11&lt;&#x2F;sup&gt;
&lt;p&gt;AMD Corporation. “3DNow! Technology Manual.” Publication
21928, 2000. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.amd.com&#x2F;content&#x2F;dam&#x2F;amd&#x2F;en&#x2F;documents&#x2F;archived-tech-docs&#x2F;programmer-references&#x2F;21928.pdf&quot;&gt;https:&#x2F;&#x2F;www.amd.com&#x2F;content&#x2F;dam&#x2F;amd&#x2F;en&#x2F;documents&#x2F;archived-tech-docs&#x2F;programmer-references&#x2F;21928.pdf&lt;&#x2F;a&gt;. Accessed
January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;13&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;12&lt;&#x2F;sup&gt;
&lt;p&gt;Diefendorff, Keith. “Pentium III = Pentium II + SSE.”
&lt;em&gt;Microprocessor Report&lt;&#x2F;em&gt;, March 8, 1999. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.cs.cmu.edu&#x2F;~afs&#x2F;cs&#x2F;academic&#x2F;class&#x2F;15740-f02&#x2F;public&#x2F;doc&#x2F;discussions&#x2F;uniprocessors&#x2F;media&#x2F;mpr_p3_mar99.pdf&quot;&gt;https:&#x2F;&#x2F;www.cs.cmu.edu&#x2F;~afs&#x2F;cs&#x2F;academic&#x2F;class&#x2F;15740-f02&#x2F;public&#x2F;doc&#x2F;discussions&#x2F;uniprocessors&#x2F;media&#x2F;mpr_p3_mar99.pdf&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;14&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;13&lt;&#x2F;sup&gt;
&lt;p&gt;Stokes, Jon. “Sequel: MMX2&#x2F;SSE&#x2F;KNI.” &lt;em&gt;Ars Technica&lt;&#x2F;em&gt;, March 22, 2000.
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arstechnica.com&#x2F;features&#x2F;2000&#x2F;03&#x2F;simd&#x2F;5&#x2F;&quot;&gt;https:&#x2F;&#x2F;arstechnica.com&#x2F;features&#x2F;2000&#x2F;03&#x2F;simd&#x2F;5&#x2F;&lt;&#x2F;a&gt;. Accessed
January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;15&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;15&lt;&#x2F;sup&gt;
&lt;p&gt;Mueller, Scott. “UPGRADING AND REPAIRING PCs, 19th Edition” &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;ptgmedia.pearsoncmg.com&#x2F;imprint_downloads&#x2F;que&#x2F;bookreg&#x2F;9780132776875&#x2F;URPCs_19thEdition.pdf&quot;&gt;https:&#x2F;&#x2F;ptgmedia.pearsoncmg.com&#x2F;imprint_downloads&#x2F;que&#x2F;bookreg&#x2F;9780132776875&#x2F;URPCs_19thEdition.pdf&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;16&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;14&lt;&#x2F;sup&gt;
&lt;p&gt;Microsoft Corporation. &lt;em&gt;DirectX 7 SDK Documentation&lt;&#x2F;em&gt;.
Redmond, WA: Microsoft, 1999. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;learn.microsoft.com&#x2F;en-us&#x2F;windows&#x2F;win32&#x2F;directx&quot;&gt;https:&#x2F;&#x2F;learn.microsoft.com&#x2F;en-us&#x2F;windows&#x2F;win32&#x2F;directx&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;17&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;16&lt;&#x2F;sup&gt;
&lt;p&gt;Stokes, Jon. “The future of Prescott: when Moore gives you
lemons…” &lt;em&gt;Ars Technica&lt;&#x2F;em&gt;, June 21, 2004. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arstechnica.com&#x2F;features&#x2F;2004&#x2F;06&#x2F;prescott&#x2F;2&#x2F;&quot;&gt;https:&#x2F;&#x2F;arstechnica.com&#x2F;features&#x2F;2004&#x2F;06&#x2F;prescott&#x2F;2&#x2F;&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;18&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;19&lt;&#x2F;sup&gt;
&lt;p&gt;Markoff, John. “Technology; Intel’s Big Shift After Hitting
Technical Wall.” &lt;em&gt;The New York Times&lt;&#x2F;em&gt;, May 17, 2004. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.nytimes.com&#x2F;2004&#x2F;05&#x2F;17&#x2F;business&#x2F;technology-intel-s-big-shift-after-hitting-technical-wall.html&quot;&gt;https:&#x2F;&#x2F;www.nytimes.com&#x2F;2004&#x2F;05&#x2F;17&#x2F;business&#x2F;technology-intel-s-big-shift-after-hitting-technical-wall.html&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;19&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;17&lt;&#x2F;sup&gt;
&lt;p&gt;Gavrichenkov, Ilya. “Intel Prescott: One More Willamette-like Slow processor or a Worthy Piece (page 10)”. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20071017164126&#x2F;http:&#x2F;&#x2F;xbitlabs.com&#x2F;articles&#x2F;cpu&#x2F;display&#x2F;prescott_10.html#sect0&quot;&gt;https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20071017164126&#x2F;http:&#x2F;&#x2F;xbitlabs.com&#x2F;articles&#x2F;cpu&#x2F;display&#x2F;prescott_10.html#sect0&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;20&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;20&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. &lt;em&gt;Intel 64 and IA-32 Architectures Software Developer’s Manual&lt;&#x2F;em&gt;, Volume 1: Basic Architecture. Order Number 253665, 2016. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;dam&#x2F;www&#x2F;public&#x2F;us&#x2F;en&#x2F;documents&#x2F;manuals&#x2F;64-ia-32-architectures-software-developer-vol-2b-manual.pdf&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;dam&#x2F;www&#x2F;public&#x2F;us&#x2F;en&#x2F;documents&#x2F;manuals&#x2F;64-ia-32-architectures-software-developer-vol-2b-manual.pdf&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;22&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;21&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. &lt;em&gt;Intel SSE4 Programming Reference&lt;&#x2F;em&gt;.
Order Number D91561-003, July 2007. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;dam&#x2F;develop&#x2F;external&#x2F;us&#x2F;en&#x2F;documents&#x2F;d9156103-705230.pdf&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;dam&#x2F;develop&#x2F;external&#x2F;us&#x2F;en&#x2F;documents&#x2F;d9156103-705230.pdf&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;23&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;22&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. “Intel Extends Performance Leadership with New Pentium 4 Processors.”
&lt;em&gt;Intel Press Release&lt;&#x2F;em&gt;, May 6, 2002. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;pressroom&#x2F;archive&#x2F;releases&#x2F;2002&#x2F;20020506comp.htm&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;pressroom&#x2F;archive&#x2F;releases&#x2F;2002&#x2F;20020506comp.htm&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;25&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;23&lt;&#x2F;sup&gt;&lt;&#x2F;div&gt;
&lt;p&gt;Kamikura, Masaru. “Intel 45nm Processor Demo” &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=TGCt4NyJWTY&quot;&gt;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=TGCt4NyJWTY&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;26&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;24&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. “Intel Acquires Sarvega To Bolster Software,
Enterprise Platform Strategies.” &lt;em&gt;Intel Press Release&lt;&#x2F;em&gt;, August 17, 2005.
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;pressroom&#x2F;archive&#x2F;releases&#x2F;2005&#x2F;20050817corp.htm&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;pressroom&#x2F;archive&#x2F;releases&#x2F;2005&#x2F;20050817corp.htm&lt;&#x2F;a&gt;.
Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;See also: “Intel Buys Into XML Processing With Conformative.”
&lt;em&gt;EE Times&lt;&#x2F;em&gt;, February 8, 2006. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.eetimes.com&#x2F;intel-buys-into-xml-processing-with-conformative&#x2F;&quot;&gt;https:&#x2F;&#x2F;www.eetimes.com&#x2F;intel-buys-into-xml-processing-with-conformative&#x2F;&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;27&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;25&lt;&#x2F;sup&gt;
&lt;p&gt;Zhang, Austin (austin_zhang@linux.intel.com). “FWD:[PATCH]Using Intel CRC32 instruction to implement hardware accelerated CRC32c algorithm”. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;lore.kernel.org&#x2F;lkml&#x2F;1215422258.19059.26.camel@localhost.localdomain&#x2F;&quot;&gt;https:&#x2F;&#x2F;lore.kernel.org&#x2F;lkml&#x2F;1215422258.19059.26.camel@localhost.localdomain&#x2F;&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;28&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;50&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. &lt;em&gt;Intel 64 and IA-32 Architectures Software Developer’s Manual&lt;&#x2F;em&gt;, Volume 2B: Instruction Set Reference M-U. Order Number 253667, 2016. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cdrdv2.intel.com&#x2F;v1&#x2F;dl&#x2F;getContent&#x2F;671200&quot;&gt;https:&#x2F;&#x2F;cdrdv2.intel.com&#x2F;v1&#x2F;dl&#x2F;getContent&#x2F;671200&lt;&#x2F;a&gt;. Accessed January 15, 2026.
Note: if cdrdv2 link does not work, go to the main site at &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;developer&#x2F;articles&#x2F;technical&#x2F;intel-sdm.htm&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;developer&#x2F;articles&#x2F;technical&#x2F;intel-sdm.htm&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;29&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;26&lt;&#x2F;sup&gt;
&lt;p&gt;Kanter, David. “Intel’s Sandy Bridge Microarchitecture.”
&lt;em&gt;Real World Technologies&lt;&#x2F;em&gt;, September 3, 2010. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.realworldtech.com&#x2F;sandy-bridge&#x2F;&quot;&gt;https:&#x2F;&#x2F;www.realworldtech.com&#x2F;sandy-bridge&#x2F;&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;30&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;27&lt;&#x2F;sup&gt;
&lt;p&gt;Murph, Darren. “Leaked Intel slides reveal 8-core CPUs, AVX
instruction set.” &lt;em&gt;Engadget&lt;&#x2F;em&gt;, August 16, 2008. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.engadget.com&#x2F;2008-08-16-leaked-intel-slides-reveal-8-core-cpus-avx-instruction-set.html&quot;&gt;https:&#x2F;&#x2F;www.engadget.com&#x2F;2008-08-16-leaked-intel-slides-reveal-8-core-cpus-avx-instruction-set.html&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;31&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;28&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. &lt;em&gt;Intel AVX Programming Reference&lt;&#x2F;em&gt;.
Order Number 319433-004, December 2008. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;kib.kiev.ua&#x2F;x86docs&#x2F;Intel&#x2F;ISAFuture&#x2F;319433-004.pdf&quot;&gt;https:&#x2F;&#x2F;kib.kiev.ua&#x2F;x86docs&#x2F;Intel&#x2F;ISAFuture&#x2F;319433-004.pdf&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;32&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;29&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. &lt;em&gt;Intel® Advanced Vector Extensions 512&lt;&#x2F;em&gt; . &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;architecture-and-technology&#x2F;avx-512-overview.html&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;architecture-and-technology&#x2F;avx-512-overview.html&lt;&#x2F;a&gt;. Accessed February 18, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;33&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;30&lt;&#x2F;sup&gt;
&lt;p&gt;AMD Corporation. “Bulldozer Microarchitecture Technical
Documentation.” Santa Clara, CA: AMD, 2010. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.amd.com&#x2F;content&#x2F;dam&#x2F;amd&#x2F;en&#x2F;documents&#x2F;archived-tech-docs&#x2F;programmer-references&#x2F;43479.pdf&quot;&gt;https:&#x2F;&#x2F;www.amd.com&#x2F;content&#x2F;dam&#x2F;amd&#x2F;en&#x2F;documents&#x2F;archived-tech-docs&#x2F;programmer-references&#x2F;43479.pdf&lt;&#x2F;a&gt;. Accessed February 18, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;34&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;51&lt;&#x2F;sup&gt;
&lt;p&gt;Kanter, David. “AMD’s Bulldozer Microarchitecture.”
&lt;em&gt;Real World Technologies&lt;&#x2F;em&gt;, August 26, 2010. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.realworldtech.com&#x2F;bulldozer&#x2F;&quot;&gt;https:&#x2F;&#x2F;www.realworldtech.com&#x2F;bulldozer&#x2F;&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;35&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;31&lt;&#x2F;sup&gt;
&lt;p&gt;Advanced Micro Devices. “AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions”. May 2009. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;kib.kiev.ua&#x2F;x86docs&#x2F;AMD&#x2F;AMD64&#x2F;43479_APM_v6-r3.03.pdf&quot;&gt;https:&#x2F;&#x2F;kib.kiev.ua&#x2F;x86docs&#x2F;AMD&#x2F;AMD64&#x2F;43479_APM_v6-r3.03.pdf&lt;&#x2F;a&gt;. Accessed February 18, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;36&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;32&lt;&#x2F;sup&gt;
&lt;p&gt;Kanter, David. “The FMA3 vs FMA4 myth.”
&lt;em&gt;Real World Technologies&lt;&#x2F;em&gt;, December 19, 2011. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.realworldtech.com&#x2F;forum&#x2F;?threadid=119333&amp;amp;curpostid=119333&quot;&gt;https:&#x2F;&#x2F;www.realworldtech.com&#x2F;forum&#x2F;?threadid=119333&amp;amp;curpostid=119333&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;37&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;33&lt;&#x2F;sup&gt;
&lt;p&gt;Wikipedia. “Xeon Phi”. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;2&#x2F;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Xeon_Phi&quot;&gt;https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;2&#x2F;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Xeon_Phi&lt;&#x2F;a&gt;. Archived February 16, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;38&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;34&lt;&#x2F;sup&gt;&lt;&#x2F;div&gt;
&lt;p&gt;Intel Corporation. “Knights Corner: Your Path to Knights Landin”
&lt;em&gt;Intel Press Release&lt;&#x2F;em&gt;, September 17, 2014. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;dam&#x2F;develop&#x2F;external&#x2F;us&#x2F;en&#x2F;documents&#x2F;knights-corner-is-your-path-to-knights-landing.pdf&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;dam&#x2F;develop&#x2F;external&#x2F;us&#x2F;en&#x2F;documents&#x2F;knights-corner-is-your-path-to-knights-landing.pdf&lt;&#x2F;a&gt;. Accessed February 16, 2026.&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;39&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;35&lt;&#x2F;sup&gt;
&lt;p&gt;Downs, Travis. “Gathering Intel on Intel AVX-512 Transitions.”
&lt;em&gt;Performance Matters Blog&lt;&#x2F;em&gt;, January 17, 2020. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;travisdowns.github.io&#x2F;blog&#x2F;2020&#x2F;01&#x2F;17&#x2F;avxfreq1.html&quot;&gt;https:&#x2F;&#x2F;travisdowns.github.io&#x2F;blog&#x2F;2020&#x2F;01&#x2F;17&#x2F;avxfreq1.html&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;40&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;36&lt;&#x2F;sup&gt;
&lt;p&gt;Torvalds, Linus. “Alder Lake and AVX-512”.
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.realworldtech.com&#x2F;forum&#x2F;?threadid=193189&amp;amp;curpostid=193190&quot;&gt;https:&#x2F;&#x2F;www.realworldtech.com&#x2F;forum&#x2F;?threadid=193189&amp;amp;curpostid=193190&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;41&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;38&lt;&#x2F;sup&gt;
&lt;p&gt;Munroe, Randall. “Standards.” &lt;em&gt;xkcd&lt;&#x2F;em&gt;, July 14, 2012.
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;xkcd.com&#x2F;927&#x2F;&quot;&gt;https:&#x2F;&#x2F;xkcd.com&#x2F;927&#x2F;&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;42&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;37&lt;&#x2F;sup&gt;
&lt;p&gt;Alcorn, Paul. “Intel Nukes Alder Lake’s AVX-512 Support, Now
Fuses It Off in Silicon.” &lt;em&gt;Tom’s Hardware&lt;&#x2F;em&gt;, March 2, 2022. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.tomshardware.com&#x2F;news&#x2F;intel-nukes-alder-lake-avx-512-now-fuses-it-off-in-silicon&quot;&gt;https:&#x2F;&#x2F;www.tomshardware.com&#x2F;news&#x2F;intel-nukes-alder-lake-avx-512-now-fuses-it-off-in-silicon&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;43&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;39&lt;&#x2F;sup&gt;
&lt;p&gt;Killian, Zak. “Intel Starts Fusing Off AVX-512 In Alder Lake
Silicon To Thwart BIOS Workarounds.” &lt;em&gt;HotHardware&lt;&#x2F;em&gt;, March 3, 2022.
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;hothardware.com&#x2F;news&#x2F;intel-fusing-avx-512-alder-lake-silicon&quot;&gt;https:&#x2F;&#x2F;hothardware.com&#x2F;news&#x2F;intel-fusing-avx-512-alder-lake-silicon&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;45&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;40&lt;&#x2F;sup&gt;
&lt;p&gt;AMD Corporation. &lt;em&gt;4TH GEN AMD EPYC PROCESSOR ARCHITECTURE&lt;&#x2F;em&gt;. Third Edition September 2023. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.amd.com&#x2F;content&#x2F;dam&#x2F;amd&#x2F;en&#x2F;documents&#x2F;products&#x2F;epyc&#x2F;4th-gen-epyc-processor-architecture-white-paper.pdf&quot;&gt;https:&#x2F;&#x2F;www.amd.com&#x2F;content&#x2F;dam&#x2F;amd&#x2F;en&#x2F;documents&#x2F;products&#x2F;epyc&#x2F;4th-gen-epyc-processor-architecture-white-paper.pdf&lt;&#x2F;a&gt;. Accessed February 18, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;46&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;41&lt;&#x2F;sup&gt;
&lt;p&gt;Larabel, Michael. “AMD Zen 4 AVX-512 Performance Analysis On
The Ryzen 9 7950X.” &lt;em&gt;Phoronix&lt;&#x2F;em&gt;, September 26, 2022. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.phoronix.com&#x2F;review&#x2F;amd-zen4-avx512&quot;&gt;https:&#x2F;&#x2F;www.phoronix.com&#x2F;review&#x2F;amd-zen4-avx512&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;See also: Mysticial. “Zen4’s AVX512 Teardown.” &lt;em&gt;MersenneForum&lt;&#x2F;em&gt;, September 26, 2022. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.mersenneforum.org&#x2F;node&#x2F;21615&quot;&gt;https:&#x2F;&#x2F;www.mersenneforum.org&#x2F;node&#x2F;21615&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;47&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;42&lt;&#x2F;sup&gt;
&lt;p&gt;James, Dave. “Intel defends AVX-512 against Torvalds’
criticism.” &lt;em&gt;PC Gamer&lt;&#x2F;em&gt;, August 20, 2020. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.pcgamer.com&#x2F;intel-defends-avx-512-against-torvalds&#x2F;&quot;&gt;https:&#x2F;&#x2F;www.pcgamer.com&#x2F;intel-defends-avx-512-against-torvalds&#x2F;&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;48&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;43&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. “Intel Advanced Vector Extensions 10.2”, July 11, 2024. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cdrdv2-public.intel.com&#x2F;836199&#x2F;361050-intel-avx10.2-spec.pdf&quot;&gt;https:&#x2F;&#x2F;cdrdv2-public.intel.com&#x2F;836199&#x2F;361050-intel-avx10.2-spec.pdf&lt;&#x2F;a&gt;. Accessed February 18, 2025.
Note: if cdrdv2 link does not work, go to the main site at &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;developer&#x2F;articles&#x2F;technical&#x2F;intel-sdm.htm&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;developer&#x2F;articles&#x2F;technical&#x2F;intel-sdm.htm&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;See also: “Intel Officially Confirms AVX10.2 and APX Support in Nova Lake.”
&lt;em&gt;TechPowerUp&lt;&#x2F;em&gt;, November 13, 2025. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.techpowerup.com&#x2F;342881&#x2F;intel-officially-confirms-avx10-2-and-apx-support-in-nova-lake&#x2F;&quot;&gt;https:&#x2F;&#x2F;www.techpowerup.com&#x2F;342881&#x2F;intel-officially-confirms-avx10-2-and-apx-support-in-nova-lake&#x2F;&lt;&#x2F;a&gt;. Accessed January 16, 2026.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;additional-technical-references&quot;&gt;Additional Technical References&lt;&#x2F;h2&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;49&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;44&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. &lt;em&gt;Intel 64 and IA-32 Architectures Optimization Reference Manual&lt;&#x2F;em&gt;. Order Number 248966, April 2024. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cdrdv2-public.intel.com&#x2F;814198&#x2F;248966-Optimization-Reference-Manual-V1-050.pdf&quot;&gt;https:&#x2F;&#x2F;cdrdv2-public.intel.com&#x2F;814198&#x2F;248966-Optimization-Reference-Manual-V1-050.pdf&lt;&#x2F;a&gt;. Accessed January 16, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;50&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;52&lt;&#x2F;sup&gt;
&lt;p&gt;June 20, 2017. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;developer&#x2F;articles&#x2F;technical&#x2F;intel-avx-512-instructions.html&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;developer&#x2F;articles&#x2F;technical&#x2F;intel-avx-512-instructions.html&lt;&#x2F;a&gt;. Accessed January 16, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;51&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;53&lt;&#x2F;sup&gt;
&lt;p&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;docs&#x2F;intrinsics-guide&#x2F;index.html&quot;&gt;https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;docs&#x2F;intrinsics-guide&#x2F;index.html&lt;&#x2F;a&gt;. Accessed January 16, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;52&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;45&lt;&#x2F;sup&gt;
&lt;p&gt;Fog, Agner. “Optimizing subroutines in assembly language.” &lt;em&gt;Agner Fog’s Optimization Manuals&lt;&#x2F;em&gt;, 2024. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.agner.org&#x2F;optimize&#x2F;&quot;&gt;https:&#x2F;&#x2F;www.agner.org&#x2F;optimize&#x2F;&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;53&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;46&lt;&#x2F;sup&gt;
&lt;p&gt;System V Application Binary Interface - Intel386 Architecture Processor Supplement. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;gitlab.com&#x2F;x86-psABIs&#x2F;x86-64-ABI&quot;&gt;https:&#x2F;&#x2F;gitlab.com&#x2F;x86-psABIs&#x2F;x86-64-ABI&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;54&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;47&lt;&#x2F;sup&gt;
&lt;p&gt;Henderson, Richard. GCC Patches mailing list, December 2004. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;gcc.gnu.org&#x2F;legacy-ml&#x2F;gcc-patches&#x2F;2004-12&#x2F;msg01955.html&quot;&gt;https:&#x2F;&#x2F;gcc.gnu.org&#x2F;legacy-ml&#x2F;gcc-patches&#x2F;2004-12&#x2F;msg01955.html&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;55&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;48&lt;&#x2F;sup&gt;
&lt;p&gt;Intel Corporation. “Intel Architecture Optimization Reference Manual.” Order Number 245127-001, 1999.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;56&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;49&lt;&#x2F;sup&gt;
&lt;p&gt;Naishlos, Dorit, et al. “Autovectorization in GCC.” &lt;em&gt;GCC Summit 2004&lt;&#x2F;em&gt;. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;gcc.gnu.org&#x2F;pub&#x2F;gcc&#x2F;summit&#x2F;2004&#x2F;Autovectorization.pdf&quot;&gt;https:&#x2F;&#x2F;gcc.gnu.org&#x2F;pub&#x2F;gcc&#x2F;summit&#x2F;2004&#x2F;Autovectorization.pdf&lt;&#x2F;a&gt;. Accessed January 15, 2026.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Privacy Policy</title>
        <published>2026-01-16T00:00:00+00:00</published>
        <updated>2026-01-16T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://bgslabs.org/privacy/"/>
        <id>https://bgslabs.org/privacy/</id>
        
        <content type="html" xml:base="https://bgslabs.org/privacy/">&lt;h1 id=&quot;privacy-policy&quot;&gt;Privacy Policy&lt;&#x2F;h1&gt;
&lt;p&gt;Last updated: January 2026&lt;&#x2F;p&gt;
&lt;h2 id=&quot;data-collection&quot;&gt;Data Collection&lt;&#x2F;h2&gt;
&lt;p&gt;This website uses Google Analytics 4 to understand how visitors engage with our content. The following data is collected:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Pages visited&lt;&#x2F;li&gt;
&lt;li&gt;Time spent on pages&lt;&#x2F;li&gt;
&lt;li&gt;Referrer (how you found us)&lt;&#x2F;li&gt;
&lt;li&gt;Device&#x2F;browser type&lt;&#x2F;li&gt;
&lt;li&gt;Approximate location (country&#x2F;region)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;cookies&quot;&gt;Cookies&lt;&#x2F;h2&gt;
&lt;p&gt;We use a single cookie to remember your privacy preferences:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;BGsLab_consent&lt;&#x2F;code&gt;: Stores your choice (accept&#x2F;reject) for analytics&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;third-party-services&quot;&gt;Third-Party Services&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Google Analytics 4&lt;&#x2F;strong&gt;: Visitor statistics&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;GitHub Pages&lt;&#x2F;strong&gt;: Hosting infrastructure&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;your-rights&quot;&gt;Your Rights&lt;&#x2F;h2&gt;
&lt;p&gt;Under GDPR, you have the right to:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Request access to your data (we don’t store personal data)&lt;&#x2F;li&gt;
&lt;li&gt;Request deletion of your data (clear your browser localStorage)&lt;&#x2F;li&gt;
&lt;li&gt;Opt-out of tracking (use the “Reject” button in our cookie banner)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;contact&quot;&gt;Contact&lt;&#x2F;h2&gt;
&lt;p&gt;For questions about this policy:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Email: &lt;a href=&quot;mailto:burakgungor11235@gmail.com&quot;&gt;burakgungor11235@gmail.com&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>varm&#x2F;vasm</title>
        <published>2026-01-03T00:00:00+00:00</published>
        <updated>2026-01-03T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://bgslabs.org/projects/varm-vasm/"/>
        <id>https://bgslabs.org/projects/varm-vasm/</id>
        
        <content type="html" xml:base="https://bgslabs.org/projects/varm-vasm/">&lt;p&gt;varm and vasm represent a deep dive into systems programming, providing a low-level execution environment inspired by ARMv7.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;key-features&quot;&gt;Key Features&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Zero Garbage Collection&lt;&#x2F;strong&gt;: Direct memory management for performance and Predictability.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Cross-platform Assembly&lt;&#x2F;strong&gt;: Assembly code that runs anywhere C99 is supported.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Advanced Concurrency&lt;&#x2F;strong&gt;: Integrated support for both multi-threading and green threading.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;System Capabilities&lt;&#x2F;strong&gt;: Built-in syscalls, networking primitives, and a comprehensive standard library.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Note, this project isn’t finished and under heavy development. What’s written here is just my plan for it&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Arith</title>
        <published>2025-08-25T00:00:00+00:00</published>
        <updated>2025-08-25T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://bgslabs.org/projects/arith/"/>
        <id>https://bgslabs.org/projects/arith/</id>
        
        <content type="html" xml:base="https://bgslabs.org/projects/arith/">&lt;p&gt;Arith is a deep dive into compiler design, implementing a full interpretation pipeline from lexical analysis and parsing to bytecode compilation and execution on a custom stack-based virtual machine.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;architectural-achievements&quot;&gt;Architectural Achievements&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hand-written Lexer &amp;amp; Pratt Parser&lt;&#x2F;strong&gt;: Built from first principles to construct a robust Abstract Syntax Tree (AST).&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Two-Phase Execution Engine&lt;&#x2F;strong&gt;: lowered the AST into a custom bytecode instruction set (IR) for performance, exceeding the capabilities of simple tree-walking interpreters.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Bytecode &amp;amp; VM&lt;&#x2F;strong&gt;: A stack-based virtual machine (VM) executes the compiled instructions, demonstrating a sophisticated understanding of language implementation.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Precise Error Handling&lt;&#x2F;strong&gt;: A multi-layered system provides user-friendly feedback with exact line and column numbers.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Built with &lt;strong&gt;Rust&lt;&#x2F;strong&gt; for high performance and memory safety.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>89crypt: Matrix-Based Encryption</title>
        <published>2025-08-07T00:00:00+00:00</published>
        <updated>2025-08-07T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://bgslabs.org/projects/89crypt/"/>
        <id>https://bgslabs.org/projects/89crypt/</id>
        
        <content type="html" xml:base="https://bgslabs.org/projects/89crypt/">&lt;p&gt;89crypt offers a mathematical approach to symmetric encryption distinct from conventional cryptographic algorithms. It leverages repeating decimal expansions to populate key matrices.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;objective-achievements&quot;&gt;Objective &amp;amp; Achievements&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Mathematical Innovation&lt;&#x2F;strong&gt;: Uses periodic decimal expansions for matrix-based encryption.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Robust Engineering&lt;&#x2F;strong&gt;: Fully type-hinted Python package with modular design and comprehensive documentation.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Recognition&lt;&#x2F;strong&gt;: Finalist in the I-MAT Project Competition and TUBITAK 2204-A participant.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Open Source&lt;&#x2F;strong&gt;: Published under GPL-3.0 to encourage transparency and collaboration.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The project demonstrates high-level proficiency in software architecture, mathematical modeling, and cryptographic principles.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Ancient Sumerian NMT</title>
        <published>2023-11-06T00:00:00+00:00</published>
        <updated>2023-11-06T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://bgslabs.org/projects/sumerian-nmt/"/>
        <id>https://bgslabs.org/projects/sumerian-nmt/</id>
        
        <content type="html" xml:base="https://bgslabs.org/projects/sumerian-nmt/">&lt;p&gt;This is a research project that addresses the challenge of translating
a low-resource, linguistic isolate like Ancient Sumerian into modern Turkish.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;technical-deep-dive&quot;&gt;Technical Deep Dive&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Transformer Architecture&lt;&#x2F;strong&gt;: Fine-tuned a T5 model using transfer learning to overcome sparse data.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Custom Tokenization&lt;&#x2F;strong&gt;: Developed a tokenizer specifically tailored to Sumerian transliteration.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Complex Data Pipeline&lt;&#x2F;strong&gt;: Engineered a multi-language pipeline (Python, C, Perl) to harmonize data from diverse academic sources (ORACC, CDLI, ETCSL).&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Evaluation&lt;&#x2F;strong&gt;: Rigorous performance testing using BLEU, METEOR, and WER metrics.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This work combines digital humanities with (used to be) state-of-the-art AI to preserve and understand ancient knowledge.&lt;&#x2F;p&gt;
</content>
        
    </entry>
</feed>
