1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
|
\documentclass{article}
\usepackage{geometry}
\usepackage{hyperref}
\usepackage{pgfplots}
\usepackage{tikz}
\usepackage{minted}
\usepackage{booktabs}
\usepackage{siunitx}
\usepackage{algorithm2e}
\usepackage{float}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{wrapfig}
\geometry{a4paper}
% ----- Custom commands ------
\newcommand{\code}[1]{\small\mintinline[xleftmargin=2em, xrightmargin=2em,
breaklines]{asm}{#1}}
\newcommand{\snippet}[3]{\inputminted[firstline=#1,lastline=#2,linenos,
xleftmargin=1.5em, breaklines]{asm}{#3}}
\newcommand{\snippetC}[3]{\inputminted[firstline=#1,lastline=#2,linenos,
xleftmargin=1.5em, breaklines]{c}{#3}}
% ----------------------------
\begin{document}
\begin{center}
\pagenumbering{gobble}
\Huge
Sorting in x86\_64 Linux Assembly
\vspace{1cm}
\Large
Navid S.
\vspace{0.5cm}
Andreas K. L.
\vspace{0.5cm}
Mikkel T.
\vspace{1cm}
\large
October 31$^\text{st}$ 2025
\end{center}
\newpage
\tableofcontents
\newpage
\pagenumbering{arabic}
\section{Introduction}
Sorting is one of the most fundamental operations in programming. Sorting
algorithms play a crucial role in efficient searching, organization,
interpretation of data, and optimization problems. Furthermore, sorting is often
times the first step towards designing complicated algorithms in computer
science. This project aims to study efficient sorting of a list of coordinates,
when sorting by the second coordinate. This will be done solely in
\textit{x86\_64 Linux Assembly}. Although abstract high level languages such as
\textit{Python} and \textit{Java} allow faster and convenient development,
Assembly provides precise control over program flow, memory usage and
instruction level execution, making Assembly an excellent environment for
benchmarking and performance analysis.
Two sorting algorithms quicksort and insertionsort will be implemented and
compared against each other. Several test instances consisting of uniformly random coordinates will be
created with the goal of comparing the implementations actual runtime to their
theoretical runtimes and evaluate each implementation.
\section{Design}
\subsection{Program Flow}
The program is structured into four main phases: selecting an algorithm,
converting the inputfile to the array format, sorting that array, and printing
the array to stdout. Each phase is designed to handle a specific task, which
creates modularity and clarity. Within each phase a set of functions manage the
corresponding operations, allowing the programmer to focus on one task at a
time. This design also greatly supports debugging by simplifying tracking and
solving errors. Below is a flowchart visualizing each step, followed by a short
summary of the design.
\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{Design.drawio.png}
\caption{Program flow}
\label{fig:flowchart}
\end{figure}
\textbf{Parsing the data:} After reading the input, the received file is to be
parsed, meaning that only useful data will be extracted. This will in turn
significantly simplifies iteration through the data, by ensuring that only
meaningful are left. Lastly an important design choice has been creating an
array of pointers to each coordinate pair. This design is explained in further
detail later.
\textbf{Sorting:} After parsing, a sorting algorithm of choice can be used on
the coordinates, to sort by a preferred coordinate. The chosen algorithm can be
passed as a command line argument, otherwise a default algorithm will be run.
This design where an algorithm can be passed to the sorting function, allows the
programmed to be extended with new algorithms if needed.
\textbf{Generating output:} The last step of the program is converting the
parsed data back into a format that can be printed.
\subsection{Input and Output format}
Each line contains one and only one coordinate consisting of two values
formatted as \texttt{$\langle$x$\rangle$\textbackslash
t$\langle$y$\rangle$\textbackslash n}, where \texttt{x} and \texttt{y} are
non-negative integers in the range 0 to 32767. Although limiting, this range is
a design decision taken in order to avoid unnecessary complexity and keep focus
on the main objective of the project, efficient sorting. Including negative
numbers would for example require extra logic for parsing, comparison,
converting back and forth to ASCII, and allocating memory. Both input and output
follow this strict format.
\subsection{Selecting algorithms}
The ability to choose which algorithm to use can be quite beneficial, if you
know certain aspects about your data. Therefore a design choice has been allowing
the user to select an algorithm as a command line argument, rather than hard coding
it. This design decision is great for flexibility, and also supports future exceptions
of the program.
\subsection{Why Insertionsort and Quicksort?}
Insertion sort and quick sort have been handpicked to research the trade-offs
between algorithmic simplicity and computational efficiency. Furthermore,
these algorithms are widely different in nature. Insertion sort is an iterative
algorithm, depending primarily on memory usage, registers, and program flow,
while quick sort is a recursive algorithm using the divide and conquer approach.
This is interesting, because Quicksort is expected to excel on large random
datasets, while insertion sort is more suitable for small or nearly sorted
cases, which is another great motivation to understand what is really
happening when these algorithms are executed, and how the dataset affects them.
\begin{wrapfigure}[15]{r}{0.5\textwidth}
\centering
\begin{tikzpicture}
\draw (0,0) rectangle (4,1); % outer box
\node at (-0.5,0.5) {$A=$};
% vertical dividers
\draw (1,0) -- (1,1);
\draw (2,0) -- (2,1);
\draw (3,0) -- (3,1);
\node at (0.5,0.5) {$p_1$};
\node at (1.5,0.5) {$p_2$};
\node at (2.5,0.5) {$\cdots$};
\node at (3.5,0.5) {$p_n$};
\draw (0.1,2) rectangle (0.8,5);
\node at (0.5,2.5) {$e_{11}$};
\node at (0.45,3.6) {$\vdots$};
\node at (0.5,4.5) {$e_{1m}$};
\draw (1.1,2) rectangle (1.8,5);
\node at (1.5,2.5) {$e_{21}$};
\node at (1.45,3.6) {$\vdots$};
\node at (1.5,4.5) {$e_{2m}$};
\draw (3.1,2) rectangle (3.8,5);
\node at (3.5,2.5) {$e_{n1}$};
\node at (3.45,3.6) {$\vdots$};
\node at (3.5,4.5) {$e_{nm}$};
\draw[->] (0.5,1.1) -- (0.5,1.9);
\draw[->] (1.5,1.1) -- (1.5,1.9);
\draw[->] (3.5,1.1) -- (3.5,1.9);
\end{tikzpicture}
\caption{The format of the array $A$, where $p_n$ is the $n^\text{th}$
memory address, pointing to an array where $e_{nm}$ is the $m^\text{th}$
element of the $n^\text{th}$ array in $A$.}
\label{fig:array}
\end{wrapfigure}
\subsection{Sorting Format}
In order to be a little more general, and make our program easily expandable to
different types of data, we wanted to create our own array datatype, that
allows us to sort something more abstract, with respect to a key in that
abstract object.
\smallskip
\noindent
To do this we settled on an array of memory addresses, that
point to where the abstract objects are, see \autoref{fig:array}.
\smallskip
\noindent
These objects would in practice also be represented as arrays.
With this we can specify which index (key) in the sub-arrays we want to sort by,
and then when we swap elements in the array,
we simply swap the memory addresses around.
\section{Implementation}
This four phase design has been converted into Assembly code using the x86\_64
instruction set. Furthermore the design has been split up in modules, where each
module is a collection of closely related functions. For example
\textit{array\_maker.s} contains functions meant to parse data and convert it to
the array format. All functions in this program follow the x86 calling
conventions, with a few exceptions where all registers are preserved. Below is a
detailed explanation of some of the more interesting functions.
\subsection{Array maker}
We implemented a global function \texttt{array\_maker}, that takes the
\textit{file descriptor} as input, and returns a pointer to the start of the
array, and the number of elements. With this it then first calls a function
\texttt{create\_file\_buffer}, that reads the contents of the file and puts it
in memory. It then passes this address, and the length of the file onto the
function \texttt{make\_pairwise\_data}. This function removes
\texttt{\textbackslash t} and \texttt{\textbackslash n}, and puts the integers
side-by-side, with space for 64-bit integers. This format is nice, because it
allows us to iterate over the number of coordinates, and select every pair $i$
by going to position $2i$, and using that as the place in memory where the
$i^\text{th}$ pair in the array points to. The psudocode for this can be seen on
\autoref{fig:arraymaker}.\footnote{The assembly code is quite easy to interpret
from this psudocode, hence the psudocode instead of assembly. This also makes it
easier to understand.}
\begin{algorithm}[H]
\DontPrintSemicolon % optional: removes ;
\SetAlgoNoLine
\SetKwInOut{Input}{Input}
\SetKwInOut{Output}{Output}
\Input{File descriptor $f$}
\Output{Pointer and size of array $A$}
\BlankLine
\texttt{// Pointer to file data in $p$ and length of that data in $n$.}\\
$p,n\leftarrow$ create\_file\_buffer($f$)\\
\texttt{// Pointer to pairwise data in $p$ and number of pairs in $n$}\\
$p,n\leftarrow$ make\_pairwise\_data($p,n$)\\
\texttt{// Allocate space for final array, and save pointer in $A$}\\
$A\leftarrow$ allocate($8n$)\\
\texttt{// Put memmory address of pair $i$ at index $i$ in $A$}\\
\For{$i\leftarrow 0$ \KwTo $n-1$}{
$A[i]\leftarrow$ mem\_addr($p[2i]$)
}
\Return $A$ and $n$
\caption{Pseudocode for \texttt{array\_maker}}\label{fig:arraymaker}
\end{algorithm}
\newpage
\begin{wrapfigure}[22]{r}{0.6\textwidth}
\snippetC{1}{22}{qsort.c}
\caption{Pseudo code for quick-sort algorithm in c-style}\label{fig:qsort}
\end{wrapfigure}
\subsection{Quicksort}
The quicksort implementation uses \textbf{Hoare partition scheme},
a \textit{two-way partitioning} approach two pointers scan from opposite ends
of the array toward the middle.
\smallskip
\noindent
The left pointer advances rightward while
pointing at elements smaller than the \textit{pivot}, and the right pointer
advances leftward while pointing at elements larger than the \textit{pivot}.
When both pointers both pointers points at misplaced elements, they swap them.
\smallskip
\noindent
This continues until the pointers cross, effectively partitioning the array
into two segments.
\medskip
\noindent
In order to compare two values in the array. The algorithm \textit{dereferences}
the sub-array, and access the element at the given \textit{index}.
\smallskip
\noindent
To swap two \textit{sub-arrays}, it simply swap their \textit{pointers}
in the pointer-array.
\newpage
\subsection{Integer to ASCII}
\begin{wrapfigure}[11]{r}{0.4\textwidth}
\begin{tabular}{|c|c|c|c|c|}
\hline
\multicolumn{5}{|c|}{ASCII table :: Numeric Glyphs} \\
\hline
0 & 1 & 2 & 3 & 4 \\
\hline
$48_{10}$ & $49_{10}$ &$ 50_{10}$ & $51_{10}$ &$ 52_{10}$ \\
$30_{16}$ & $31_{16}$ & $32_{16}$ & $33_{16}$ & $34_{16}$ \\
\hline 5 & 6 & 7 & 8 & 9 \\
\hline
$53_{10}$ &$ 54_{10}$ & $55_{10}$ &$ 56_{10}$ & $57_{10}$ \\
$35_{16}$ & $36_{16}$ & $37_{16}$ & $38_{16}$ & $39_{16}$ \\
\hline
\end{tabular}
\caption{Table over numeric ASCII codes}\label{fig:ascii-table}
\end{wrapfigure}
When converting each coordinate back into \textit{ASCII},
we use the \textbf{Euclidean algorithm} to extract each digit,
but the result is still in binary.
\smallskip
\noindent
To convert the binary digit to its \textit{ASCII} character,
we simply add the \textit{ASCII} code for \texttt{0},
which is \texttt{48} in decimal and \texttt{30} in hexadecimal.
This works because each digit is in order in the \textit{ASCII Table}.
\smallskip
\noindent
But the algorithm have the downside of getting each digit
in \textit{reverse} order, means we need to start right side of the \textit{buffer} when
adding new digits, as to not \textit{overwrite} previous digits.
\smallskip
\noindent
So each digit needs to be shifted down to \textit{leftmost} position in memory.
\bigskip
\noindent
To do this we used \texttt{REP MOVSB} which is a string operation in the
\textit{x86 instructionset architecture} that performs a
\textit{repeated byte copy operation}.
\noindent
\begin{wrapfigure}[8]{l}{0.55\textwidth}
\begin{tabular}{|c|c|}
\hline
\multicolumn{2}{|c|}{\texttt{REP MOVSB} :: Operands Overview} \\
\hline
Register/Input & Purpose \\
\hline
\texttt{RSI} & Source memory addresse \\
\texttt{RDI} & Destination memory addresse \\
\texttt{RCX} & Repeat Count \\
\texttt{DF} & Direction \\
\hline
\end{tabular}
\caption{Table over \texttt{MOVSB} operands}\label{fig:movsb-operands}
\end{wrapfigure}
\noindent
It copies data from the memory location pointed to by
the \textit{source} index register (\texttt{RSI}) to the location pointed to by
the \textit{destination} index register (\texttt{RDI}),
repeating this operation for a number of \textit{iterations} specified by
the count register (\texttt{RCX}).
\vspace{1cm}
\noindent
\texttt{RSI} and \texttt{RDI} are automatically either \textit{incremented} or
\textit{decremented}, based on direction flag (\texttt{DF});
when the direction flag is \textit{clear}, both pointers \textit{increment},
and when it's \textit{set} both pointers \textit{decrement}.
\smallskip
\noindent
This \textit{instruction} is particularly useful for bulk memory operations
such as copying strings
as it encodes a potentially lengthy loop into a \textit{single} instruction,
thereby improving both code density
\section{Evaluation}
To evaluate whether out program works, we created three \textit{shell scripts},
one for generating test data (\textit{generate\_test\_data.sh}), one for testing
the validity of the output of the program (\textit{test.sh}) and one for testing
the execution time of the program (\textit{benchmark.sh}).
The script that generates test files, generates files of size $n$ for all
$\{1000n\mid 0\leq n\leq 100\}$. The size is equivalent to the number of
coordinates. For each $n$ we also create three different kinds of test files.
One where all of the data is random, one where the data is already sorted, and
one where it is reversely sorted.\footnote{We also did run it for $n=$ 1 million
with random, but to get a pretty plot we decided to omit these values, so it was
not too sparse. Quicksort ran for 0.159s and Insertionsort ran for 386.935s.
This makes Quicksort roughly 2400 times faster than Insertionsort.} This way we
can see both the average case, and worst case of any algorithm we decide to use.
Now we can use our \textit{test.sh} script, which runs the program with each of
the generated test files for all sorting algorithms of the program, and compares
the output with the output from the \textit{builtin} function \textit{sort}.
Given all of the tests pass, we can then run \textit{benchmark.sh}, which times
how long it takes for our program to run with each test file for each algorithm,
and saves the result in a \textit{csv} file. The results of this \textit{csv}
file, together with functions asymptotically equivalent to $\mathcal O(n)$, $\mathcal O(n\log
n)$ and $\mathcal O(n^2)$, is shown on \autoref{fig:benchmarktime}.
\begin{figure}[H]
\centering
\begin{tikzpicture}
\begin{axis}[
xlabel={size({$n$})},
ylabel={seconds},
ymode=log,
% xmode=log,
width=\textwidth,
legend pos=north west,
clip mode=individual,
mark size=1pt,
scatter/classes={
isort_random={mark=*,red},
isort_reverse={mark=*,magenta},
isort_sorted={mark=*,pink},
qsort_random={mark=*,teal},
qsort_reverse={mark=*,blue},
qsort_sorted={mark=*,cyan}
}
]
% -- Scatter plot --
\addplot[scatter, only marks, scatter src=explicit symbolic] table
[x=size, y=time, meta=type, col sep=comma]
{../test/benchmark_results.csv};
\legend{isort\_random, isort\_reverse, isort\_sorted, qsort\_random,
qsort\_reverse, qsort\_sorted}
% -- O(n)
\addplot[domain=1:100000, samples=100, ultra thick, densely dotted, black]
{(x * 1/18000000+0.0003)};
\addlegendentry{$\mathcal O(n)$}
% -- O(nlogn)
\addplot[domain=1:100000, samples=100, ultra thick, dotted, black]
{(x*ln(x)) * 1/80000000 + 0.0003};
\addlegendentry{$\mathcal O(n\log n)$}
% -- O(n^2)
\addplot[domain=1:100000, samples=100, ultra thick, dashed, black]
{(x*x) * 1/5000000000+0.0003};
\addlegendentry{$\mathcal O(n^2)$}
\end{axis}
\end{tikzpicture}
\caption{Benchmarking data for timing from Insertionsort and Quicksort for
both random, sorted and reverse sorted data, plotted with a logarithmic y
axis}\label{fig:benchmarktime}
\end{figure}
From \autoref{fig:benchmarktime} we see that Insertionsort behaves as expected
with a best-case runtime of $\mathcal O(n)$ when the input is sorted, and a
worst-case runtime of $\mathcal O(n^2)$, when the input reversely sorted or
random. We also see that Quicksort behaves as expected with a best-case runtime
of $\mathcal O(n\log n)$ when the input is random, and a worst-case runtime of
$\mathcal O(n^2)$ when the input is either sorted or reversely sorted.
It is however also possible to count the number of comparisons for each
algorithm. This plot can be seen on \autoref{fig:benchmarkcomp}. Here we clearly
see that, for example, Insertionsort has exactly $n$ comparisons, when sorting
and already sorted array of $n$ elements. We also see that Quicksort with random
elements, matches - within a scalar - with $n\log n$.
\begin{figure}[H]
\centering
\begin{tikzpicture}
\begin{axis}[
xlabel={size({$n$})},
ylabel={seconds},
ymode=log,
% xmode=log,
width=\textwidth,
legend pos=south east,
clip mode=individual,
mark size=1pt,
scatter/classes={
isort_random={mark=*,red},
isort_reverse={mark=*,magenta},
isort_sorted={mark=*,pink},
qsort_random={mark=*,teal},
qsort_reverse={mark=*,blue},
qsort_sorted={mark=*,cyan}
}
]
% -- Scatter plot --
\addplot[scatter, only marks, scatter src=explicit symbolic] table
[x=size, y=comp, meta=type, col sep=comma]
{../test/benchmark_comparisons.csv};
\legend{isort\_random, isort\_reverse, isort\_sorted, qsort\_random,
qsort\_reverse, qsort\_sorted}
% -- O(n)
\addplot[domain=1:100000, samples=100, ultra thick, densely dotted, black]
{(x)};
\addlegendentry{$n$}
% -- O(nlogn)
\addplot[domain=1:100000, samples=100, ultra thick, dotted, black]
{(x*ln(x))};
\addlegendentry{$n\log n$}
% -- O(n^2)
\addplot[domain=1:100000, samples=100, ultra thick, dashed, black]
{(x*x)};
\addlegendentry{$n^2$}
\end{axis}
\end{tikzpicture}
\caption{Benchmarking data for comparisons from Insertionsort and Quicksort
for both random, sorted and reverse sorted data. The y axis is logarithmic,
and the reference lines are not asymptotic but
exact.}\label{fig:benchmarkcomp}
\end{figure}
\subsection{Nonalgorithmic factors}
Although the asymptotic run times are caused by the algorithms themselves,
the total run times are also greatly affected by the concrete implementations
of highly repeated functions. A good example is our old generateOutput function,
which was extremely slow. This function has since been removed and exchanged with
a new approach using a buffer. The implementation looped over the sorted list and
used 4 write syscalls per coordinate: One for printing x-coordinate, one for
printing y, one for tab, and one for a newline character. Thus 4 million syscalls
would be needed just to print a sorted list of 1 million coordinates, which causes
an enormous program overhead. So the point is that even though the problem studied
is efficient sorting, correct usage of registers, memory and syscalls will greatly
affect and boost performance.
\section{Conclusion}
In this project insertionsort and quicksort have been implemented in assembly x86 64
with the goal of studying efficient sorting. Although the algorithms perform the same task,
they are in fact very different. Firstly, from the benchmarks we have learned that the
algorithms are heavily affected by the dataset itself. Both algorithms have very
different run times on the sorted, reversed, and random datasets, despite the sets
having the same size. As discussed, earlier insertionsort and quicksort have roughly
the same performance on reversed datasets, insertionsort is optimal on sorted or
small datasets, and quicksort excels on random and very large datasets. So one
could say that efficient sorting requires an efficient algorithm well suited for
the dataset. But it should also be mentioned that in most general cases with
random output quicksort will most likely outperform insertionsort. Furthermore,
from the testings we have learned that the stable tests used on insertion sort
cannot be used on quicksort, since quicksort doesn’t preserve the order the inputs.
This means that the algorithms might produce different results even if used
on the same dataset. Another important observation has been that although the algorithms
themselves determine the asymptotic run times, the concrete implementations also matter
for the total run times. Which means that memory usage, registers, stack management, and
syscalls should also be considered while trying to create a fast sorting algorithm. Summing
all these observations up in one sentence, we can conclude that a recipe to efficient sorting
is choosing a well suited algorithm supported by a smart implementation.
\end{document}
|