\documentclass{article} \usepackage{geometry} \usepackage{hyperref} \usepackage{pgfplots} \usepackage{tikz} \usepackage{minted} \usepackage{booktabs} \usepackage{siunitx} \usepackage{algorithm2e} \usepackage{float} \usepackage{caption} \usepackage{subcaption} \usepackage{wrapfig} \geometry{a4paper} \DeclareCaptionFormat{custom} { \textbf{#1#2}\textit{\small #3} } \captionsetup{format=custom, justification=centering} % ----- Custom commands ------ \newcommand{\code}[1]{\small\mintinline[xleftmargin=2em, xrightmargin=2em, breaklines]{asm}{#1}} \newcommand{\snippet}[3]{\inputminted[firstline=#1,lastline=#2,linenos, xleftmargin=1.5em, breaklines]{asm}{#3}} \newcommand{\snippetC}[3]{\inputminted[firstline=#1,lastline=#2,linenos, xleftmargin=1.5em, breaklines]{c}{#3}} % ---------------------------- \begin{document} \begin{center} \pagenumbering{gobble} \Huge Sorting in X86-64 Linux Assembly \vspace{1cm} \Large Navid S. \vspace{0.5cm} Andreas K. L. \vspace{0.5cm} Mikkel T. \vspace{1cm} \large October 31$^\text{st}$ 2025 \end{center} \newpage \tableofcontents \newpage \pagenumbering{arabic} \section{Introduction} Sorting is one of the most fundamental operations in programming. Sorting algorithms play a crucial role in efficient searching, organization, interpretation of data, and optimization problems. Furthermore, sorting is often the first step towards designing complicated algorithms in computer science. \smallskip \noindent This project aims to study efficient sorting of a list of coordinates, when sorting by the second coordinate. This will be done entirely in \textit{x86-64 Linux Assembly}. Although abstract high level languages such as \textit{Python} and \textit{Java} allow faster and convenient development, Assembly provides precise control over program flow, memory usage and instruction level execution, making Assembly an excellent environment for benchmarking and performance analysis. \smallskip \noindent Two sorting algorithms \textit{quicksort} and \textit{insertionsort} will be implemented and compared against each other. Several test instances consisting of uniformly random coordinates will be created with the goal of comparing the implementations actual runtime to their theoretical runtimes and evaluate each implementation. \section{Design} \subsection{Program Flow} The program is structured into four main \textit{phases}: \noindent \textbf{selecting} an algorithm, \textbf{converting} the inputfile to the array format, \textbf{sorting} that array, and \textbf{writing} the array to \textit{stdout}. \smallskip \noindent Each \textit{phase} is designed to handle a specific task, which creates modularity and clarity. Within each \textit{phase} a set of functions manage the corresponding operations, allowing the programmer to focus on one task at a time. This design also greatly supports \textit{debugging} by simplifying tracking and solving errors. The flowchart at \autoref{fig:flowchart}, visualizes this process. \begin{figure}[H] \centering \includegraphics[width=\linewidth]{Design.drawio.png} \caption{Program flow} \label{fig:flowchart} \end{figure} \noindent \textbf{Parsing the data:} After reading the input, the received file is to be \textit{parsed}, meaning that only useful data will be extracted. This will in turn significantly simplify \textit{iteration} through the data, by ensuring that only meaningful data is left. \smallskip \noindent Lastly an important design choice has been creating an array of \textit{pointers} to each coordinate pair. This design is explained in further detail later. \medskip \noindent \textbf{Sorting:} After parsing, a sorting algorithm of choice can be used on the coordinates, to sort by a preferred coordinate. The chosen algorithm can be passed as a \textit{command line argument}, otherwise a default algorithm will be run. \smallskip \noindent This design where an algorithm can be passed to the sorting function, allows the program to be extended with new algorithms if needed. \smallskip \noindent \textbf{Generating output:} The last step of the program is converting the parsed data back into a format that can be written to \textit{stdout}. \subsection{Input and Output format} Each line contains one pair of coordinates consisting of two coordinate values formatted as \texttt{$\langle$x$\rangle$\textbackslash t$\langle$y$\rangle$\textbackslash n}, where \texttt{x} and \texttt{y} are non-negative integers in the range 0 to 32767. \smallskip \noindent Although limiting, this range is a design decision taken in order to avoid unnecessary complexity and keep focus on the main objective of the project; efficient sorting. Including negative numbers would for example require extra logic for parsing, comparison, converting back and forth to ASCII, and allocating memory. \smallskip \subsection{Selecting algorithms} The ability to choose which algorithm to use can be quite beneficial, if you know certain aspects about your data. Therefore a design choice has been allowing the user to select an algorithm as a command line argument, rather than hard coding it. \smallskip \noindent This design decision is great for flexibility, and also supports future exceptions of the program. \subsection{Why Insertionsort and Quicksort?} Insertionsort and quicksort have been handpicked to research the trade-offs between algorithmic simplicity and computational efficiency. Furthermore, these algorithms are widely different in nature. \smallskip \noindent Insertionsort is an iterative algorithm, depending primarily on memory usage, registers, and program flow, while quicksort is a recursive algorithm using the \textit{divide and conquer} approach. This is interesting, because quicksort is expected to excel on large random datasets, while insertionsort is more suitable for small or nearly sorted cases, which is another great motivation to understand what is really happening when these algorithms are executed, and how the dataset affects them. \begin{wrapfigure}[15]{r}{0.5\textwidth} \centering \begin{tikzpicture} \draw (0,0) rectangle (4,1); % outer box \node at (-0.5,0.5) {$A=$}; % vertical dividers \draw (1,0) -- (1,1); \draw (2,0) -- (2,1); \draw (3,0) -- (3,1); \node at (0.5,0.5) {$p_1$}; \node at (1.5,0.5) {$p_2$}; \node at (2.5,0.5) {$\cdots$}; \node at (3.5,0.5) {$p_n$}; \draw (0.1,2) rectangle (0.8,5); \node at (0.5,2.5) {$e_{11}$}; \node at (0.45,3.6) {$\vdots$}; \node at (0.5,4.5) {$e_{1m}$}; \draw (1.1,2) rectangle (1.8,5); \node at (1.5,2.5) {$e_{21}$}; \node at (1.45,3.6) {$\vdots$}; \node at (1.5,4.5) {$e_{2m}$}; \draw (3.1,2) rectangle (3.8,5); \node at (3.5,2.5) {$e_{n1}$}; \node at (3.45,3.6) {$\vdots$}; \node at (3.5,4.5) {$e_{nm}$}; \draw[->] (0.5,1.1) -- (0.5,1.9); \draw[->] (1.5,1.1) -- (1.5,1.9); \draw[->] (3.5,1.1) -- (3.5,1.9); \end{tikzpicture} \caption{The format of the array $A$, where $p_n$ is the $n^\text{th}$ memory address, pointing to an array where $e_{nm}$ is the $m^\text{th}$ element of the $n^\text{th}$ array in $A$.} \label{fig:array} \end{wrapfigure} \subsection{Sorting Format} In order to be a little more general, and make our program easily expandable to different types of data, we wanted to create our own array datatype, that allows us to sort something more abstract, with respect to a key in that abstract object. \smallskip \noindent To do this we settled on an array of memory addresses, that point to where the abstract objects are, see \autoref{fig:array}. \smallskip \noindent These objects would in practice also be represented as arrays. With this we can specify which index (key) in the sub-arrays we want to sort by, and then when we swap elements in the array, we simply swap the memory addresses around. \section{Implementation} This four phase design has been converted into \textit{Assembly} code using the \textit{x86-64} instruction set. Furthermore the design has been split up in modules, where each module is a collection of closely related functions. \noindent For example \texttt{array\_maker.s} contains functions meant to parse data and convert it to the array format. All functions in this program follow the \textit{x86 calling conventions}, which makes adding more features a lot easier. \smallskip \noindent Below is a detailed explanation of some of the more interesting functions. \begin{wrapfigure}{r}{0.6\textwidth} \textbf{Input\ \ :} File descriptor $f$\\ \textbf{Output:} Pointer and size of array $A$\\ \\ \texttt{// Pointer to file data in $p$ and}\\ \texttt{// length of that data in $n$.}\\ $p,n\leftarrow$ create\_file\_buffer($f$)\\ \\ \texttt{// Pointer to pairwise data in $p$}\\ \texttt{// and number of pairs in $n$}\\ $p,n\leftarrow$ make\_pairwise\_data($p,n$)\\ \\ \texttt{// Allocate space for final array,}\\ \texttt{// and save pointer in $A$}\\ $A\leftarrow$ allocate($8n$)\\ \\ \texttt{// Put memmory address of pair $i$}\\ \texttt{// at index $i$ in $A$}\\ \textbf{for } $i\leftarrow 0$\textbf{ to }$n-1$\textbf{ do}\\ \hspace*{2em}$A[i]\leftarrow$ mem\_addr($p[2i]$)\\ \textbf{end}\\ \textbf{return} $A$ and $n$ \caption{Pseudocode for \texttt{array\_maker}}\label{fig:arraymaker} \end{wrapfigure} \subsection{Array maker} We implemented a global function \texttt{array\_maker}, that takes the \textit{file descriptor} as input, and returns a pointer to the start of the array, and the number of elements. \smallskip \noindent The first thing the function does is call another function called \texttt{create\_file\_buffer}, that reads the contents of the file and puts it in memory. It then passes this address, and the length of the file onto the function \texttt{make\_pairwise\_data}. \smallskip \noindent This function removes \texttt{\textbackslash t} and \texttt{\textbackslash n}, and puts the integers side-by-side, with space for 64-bit integers. This format is nice, because it allows us to iterate over the number of coordinates, and select every pair $i$ by going to position $2i$, and using that as the place in memory where the $i^\text{th}$ pair in the array points to. \smallskip \noindent The psudocode for this can be seen on \autoref{fig:arraymaker}.\footnote{The assembly code is quite easy to recreate from this psudocode, hence the psudocode instead of assembly. This also makes it easier to convey its functionality.} \newpage \begin{wrapfigure}[22]{r}{0.6\textwidth} \snippetC{1}{22}{qsort.c} \caption{Pseudo code for quick-sort algorithm in c-style}\label{fig:qsort} \end{wrapfigure} \subsection{Quicksort} The quicksort implementation uses \textbf{Hoare partition scheme}, which is a \textit{two-way partitioning} approach, where two pointers scan from opposite ends of the array toward the middle. \smallskip \noindent The left pointer advances rightward while pointing at elements smaller than the \textit{pivot}, and the right pointer advances leftward while pointing at elements larger than the \textit{pivot}. When both pointers point to misplaced elements, they swap them. \smallskip \noindent This continues until the pointers cross, effectively partitioning the array into two segments. \medskip \noindent In order to compare two values in the array. The algorithm \textit{dereferences} the sub-array, and accesses the element at the given \textit{index}. \smallskip \noindent To swap two \textit{sub-arrays}, it simply swap their \textit{pointers} in the pointer-array. \newpage \subsection{Integer to ASCII} \begin{wrapfigure}[11]{r}{0.4\textwidth} \begin{tabular}{|c|c|c|c|c|} \hline \multicolumn{5}{|c|}{ASCII table :: Numeric Glyphs} \\ \hline 0 & 1 & 2 & 3 & 4 \\ \hline $48_{10}$ & $49_{10}$ &$ 50_{10}$ & $51_{10}$ &$ 52_{10}$ \\ $30_{16}$ & $31_{16}$ & $32_{16}$ & $33_{16}$ & $34_{16}$ \\ \hline 5 & 6 & 7 & 8 & 9 \\ \hline $53_{10}$ &$ 54_{10}$ & $55_{10}$ &$ 56_{10}$ & $57_{10}$ \\ $35_{16}$ & $36_{16}$ & $37_{16}$ & $38_{16}$ & $39_{16}$ \\ \hline \end{tabular} \caption{Table over numeric ASCII codes}\label{fig:ascii-table} \end{wrapfigure} When converting each coordinate back into \textit{ASCII}, we use the \textbf{Euclidean algorithm} to extract each digit, but the result is still in binary. \smallskip \noindent To convert the binary digit to its \textit{ASCII} character, we simply add the \textit{ASCII} code for \texttt{0}, which is \texttt{48} in decimal and \texttt{30} in hexadecimal. This works because each digit is in order, in the \textit{ASCII Table}. \smallskip \noindent The algorithms downside is getting each digit in \textit{reverse} order, which means we need to start right side of the \textit{buffer} when adding new digits, as to not \textit{overwrite} previous digits. \smallskip \noindent So each digit needs to be shifted down to the \textit{leftmost} position in memory. \bigskip \noindent To do this we used \texttt{REP MOVSB} which is a string operation in the \textit{x86 instructionset architecture} that performs a \textit{repeated byte copy operation}. \noindent \begin{wrapfigure}[8]{l}{0.55\textwidth} \begin{tabular}{|c|c|} \hline \multicolumn{2}{|c|}{\texttt{REP MOVSB} :: Operands Overview} \\ \hline Register/Input & Purpose \\ \hline \texttt{RSI} & Source memory addresse \\ \texttt{RDI} & Destination memory addresse \\ \texttt{RCX} & Repeat Count \\ \texttt{DF} & Direction \\ \hline \end{tabular} \caption{Table over \texttt{MOVSB} operands}\label{fig:movsb-operands} \end{wrapfigure} \noindent It copies data from the memory location pointed to by the \textit{source} index register (\texttt{RSI}) to the location pointed to by the \textit{destination} index register (\texttt{RDI}), repeating this operation for a number of \textit{iterations} specified by the count register (\texttt{RCX}). \vspace{1cm} \noindent \texttt{RSI} and \texttt{RDI} are automatically either \textit{incremented} or \textit{decremented}, based on direction flag (\texttt{DF}); when the direction flag is \textit{clear}, both pointers \textit{increment}, and when it's \textit{set} both pointers \textit{decrement}. \smallskip \noindent This \textit{instruction} is particularly useful for bulk memory operations such as copying strings, as it encodes a potentially lengthy loop into a \textit{single} instruction, thereby improving code density. \section{Evaluation} To evaluate whether our program works, we created three \textit{shell scripts}, one for generating test data (\textit{generate\_test\_data.sh}), one for testing the validity of the output from the program (\textit{test.sh}) and one for testing the execution time of the program (\textit{benchmark.sh}). \smallskip \noindent The script that generates test files, generates files with the sizes $\{1000n\mid 0\leq n\leq 100\}$. The size is equivalent to the number of coordinates. For each $n$ we also create three different kinds of test files. One where all of the data is random, one where the data is already sorted, and one where it is reversely sorted.\footnote{We also did run it for $n=$ 1 million with random, but to get a pretty plot we decided to omit these values, so it was not too sparse. Quicksort ran for 0.159s and insertionsort ran for 386.935s. This makes quicksort roughly 2400 times faster than insertionsort.} \smallskip \noindent This way we can see both the average case, and worst case of any algorithm we decide to use. Now we can use our \textit{test.sh} script, which runs the program with each of the generated test files for all sorting algorithms of the program, and outputs the result to a temporary file. It can then use the python script \textit{check.py}, which checks if the second argument is a valid sorting of the first argument given. \subsection{Benchmarking} Given all of the tests pass, we can then run \textit{benchmark.sh}, which times how long it takes for our program to run with each test file for each algorithm, and saves the result in a \textit{csv} file. The results of this \textit{csv} file, together with functions asymptotically equivalent to $\mathcal O(n)$, $\mathcal O(n\log n)$ and $\mathcal O(n^2)$, is shown on \autoref{fig:benchmarktime}. \begin{figure}[H] \centering \begin{tikzpicture} \begin{axis}[ xlabel={size({$n$})}, ylabel={time($s$)}, ymode=log, % xmode=log, width=\textwidth, legend pos=north west, clip mode=individual, mark size=1pt, scatter/classes={ isort_random={mark=*,red}, isort_reverse={mark=*,magenta}, isort_sorted={mark=*,pink}, qsort_random={mark=*,teal}, qsort_reverse={mark=*,blue}, qsort_sorted={mark=*,cyan} } ] % -- Scatter plot -- \addplot[scatter, only marks, scatter src=explicit symbolic] table [x=size, y=time, meta=type, col sep=comma] {../test/benchmark_results.csv}; \legend{isort\_random, isort\_reverse, isort\_sorted, qsort\_random, qsort\_reverse, qsort\_sorted} % -- O(n) \addplot[domain=1:100000, samples=100, ultra thick, densely dotted, black] {(x * 1/18000000+0.0003)}; \addlegendentry{$\mathcal O(n)$} % -- O(nlogn) \addplot[domain=1:100000, samples=100, ultra thick, dotted, black] {(x*ln(x)) * 1/80000000 + 0.0003}; \addlegendentry{$\mathcal O(n\log n)$} % -- O(n^2) \addplot[domain=1:100000, samples=100, ultra thick, dashed, black] {(x*x) * 1/5000000000+0.0003}; \addlegendentry{$\mathcal O(n^2)$} \end{axis} \end{tikzpicture} \caption{Benchmarking data for timing from insertionsort and quicksort for both random, sorted and reverse sorted data, plotted with a logarithmic y axis}\label{fig:benchmarktime} \end{figure} \smallskip \noindent From \autoref{fig:benchmarktime} we see that insertionsort behaves as expected with a best-case runtime of $\mathcal O(n)$ when the input is sorted, and a worst-case runtime of $\mathcal O(n^2)$, when the input reversely sorted or random. We also see that quicksort behaves as expected with a best-case runtime of $\mathcal O(n\log n)$ when the input is random, and a worst-case runtime of $\mathcal O(n^2)$ when the input is either sorted or reversely sorted. \smallskip \noindent It is however also possible to count the number of comparisons for each algorithm. This plot can be seen on \autoref{fig:benchmarkcomp}. Here we clearly see that, for example, insertionsort has exactly $n$ comparisons, when sorting an already sorted array of $n$ elements. We also see that quicksort with random elements, matches - within a scalar - with $n\log n$. \begin{figure}[H] \centering \begin{tikzpicture} \begin{axis}[ xlabel={size({$n$})}, ylabel={time($s$)}, ymode=log, % xmode=log, width=\textwidth, legend pos=south east, clip mode=individual, mark size=1pt, scatter/classes={ isort_random={mark=*,red}, isort_reverse={mark=*,magenta}, isort_sorted={mark=*,pink}, qsort_random={mark=*,teal}, qsort_reverse={mark=*,blue}, qsort_sorted={mark=*,cyan} } ] % -- Scatter plot -- \addplot[scatter, only marks, scatter src=explicit symbolic] table [x=size, y=comp, meta=type, col sep=comma] {../test/benchmark_comparisons.csv}; \legend{isort\_random, isort\_reverse, isort\_sorted, qsort\_random, qsort\_reverse, qsort\_sorted} % -- O(n) \addplot[domain=1:100000, samples=100, ultra thick, densely dotted, black] {(x)}; \addlegendentry{$n$} % -- O(nlogn) \addplot[domain=1:100000, samples=100, ultra thick, dotted, black] {(x*ln(x))}; \addlegendentry{$n\log n$} % -- O(n^2) \addplot[domain=1:100000, samples=100, ultra thick, dashed, black] {(x*x)}; \addlegendentry{$n^2$} \end{axis} \end{tikzpicture} \caption{Benchmarking data for comparisons from insertionsort and quicksort for both random, sorted and reverse sorted data. The y axis is logarithmic, and the reference lines are not asymptotic but exact.}\label{fig:benchmarkcomp} \end{figure} \subsection{Nonalgorithmic factors} Although the asymptotic run times are caused by the algorithms themselves, the total run times are also greatly affected by the concrete implementations of highly repeated functions. \smallskip \noindent A good example is our old \texttt{generateOutput} function, which was extremely slow. This function has since been removed and exchanged with a new approach using a buffer. The implementation looped over the sorted list and used 4 write \textit{syscalls} per coordinate: one for printing the $x$-coordinate, one for printing the $y$-coordinate, one for tab, and one for a newline character. Thus 4 million syscalls would be needed just to print a sorted list of 1 million coordinates, which causes an enormous program overhead. \smallskip \noindent So the point is that even though the problem studied is efficient sorting, correct usage of registers, memory and syscalls will greatly affect and boost performance. \section{Conclusion} In this project insertionsort and quicksort have been implemented in \textit{x86-64 assembly} with the goal of studying efficient sorting. Although the algorithms perform the same task, they are in fact very different. \smallskip \noindent Firstly, from the benchmarks we have learned that the algorithms are heavily affected by the dataset itself. Both algorithms have very different run times on the sorted, reversed, and random datasets, despite the sets having the same size. \smallskip \noindent As discussed earlier, insertionsort and quicksort have roughly the same performance on reversed datasets, insertionsort is optimal on sorted or small datasets, and quicksort excels on random and very large datasets. Thus efficient sorting requires an efficient algorithm well-suited for the dataset. But it should also be mentioned that in most general cases with random output quicksort will outperform insertionsort. \smallskip \noindent Another important observation has been that although the algorithms themselves determine the asymptotic run times, the concrete implementations also matter for the total run times. Which means that memory usage, registers, stack management, and syscalls should also be considered while trying to create a fast sorting algorithm. \smallskip \noindent Summing all these observations up in one sentence, we can conclude that a recipe to efficient sorting is choosing a well suited algorithm supported by a smart implementation. \end{document}