\documentclass{article} \usepackage{geometry} \usepackage{hyperref} \usepackage{pgfplots} \usepackage{tikz} \usepackage{minted} \usepackage{booktabs} \usepackage{siunitx} \usepackage{algorithm2e} \usepackage{float} \usepackage{caption} \usepackage{subcaption} \geometry{a4paper} % ----- Custom commands ------ \newcommand{\code}[1]{\small\mintinline[xleftmargin=2em, xrightmargin=2em, breaklines]{asm}{#1}} \newcommand{\snippet}[3]{\inputminted[firstline=#1,lastline=#2,linenos, xleftmargin=1.5em, breaklines]{asm}{#3}} % ---------------------------- \begin{document} \begin{center} \Huge Sorting in x86\_64 Linux Assembly \vspace{1cm} \Large Navid S. \vspace{0.5cm} Andreas K. L. \vspace{0.5cm} Mikkel T. \vspace{1cm} \large October 31$^\text{st}$ 2025 \end{center} \newpage \tableofcontents \newpage \section{Introduction} Sorting is one of the most fundamental operations in programming. Sorting algorithms play a crucial role in efficient searching, organization, interpretation of data, and optimization problems. Furthermore, sorting is often times the first step towards designing complicated algorithms in computer science. This project aims to study efficient sorting of a list of coordinates, when sorting by the second coordinate. This will be done solely in \textit{x86\_64 Linux Assembly}. Although abstract high level languages such as \textit{Python} and \textit{Java} allow faster and convenient development, Assembly provides precise control over program flow, memory usage and instruction level execution, making Assembly an excellent environment for benchmarking and performance analysis. Two sorting algorithms quicksort and insertionsort will be implemented and compared against each other. Several test instances consisting of uniformly random coordinates will be created with the goal of comparing the implementations actual runtime to their theoretical runtimes and evaluate each implementation. \section{Design} \subsection{Program Flow} The program is structured into four main phases: selecting an algorithm, converting the inputfile to the array format, sorting that array, and printing the array to stdout. Each phase is designed to handle a specific task, which creates modularity and clarity. Within each phase a set of functions manage the corresponding operations, allowing the programmer to focus on one task at a time. This design also greatly supports debugging by simplifying tracking and solving errors. Below is a flowchart visualizing each step, followed by a short summary of the design. \begin{figure}[H] \centering \includegraphics[width=\linewidth]{Design.drawio.png} \caption{Program flow} \label{fig:flowchart} \end{figure} \textbf{Parsing the data:} After reading the input, the received file is to be parsed, meaning that only useful data will be extracted. This will in turn significantly simplifies iteration through the data, by ensuring that only meaningful are left. Lastly an important design choice has been creating an array of pointers to each coordinate pair. This design is explained in further detail later. \textbf{Sorting:} After parsing, a sorting algorithm of choice can be used on the coordinates, to sort by a preferred coordinate. The chosen algorithm can be passed as a command line argument, otherwise a default algorithm will be run. This design where an algorithm can be passed to the sorting function, allows the programmed to be extended with new algorithms if needed. \textbf{Generating output:} The last step of the program is converting the parsed data back into a format that can be printed. \subsection{Input and Output format} Each line contains one and only one coordinate consisting of two values formatted as \texttt{$\langle$x$\rangle$\textbackslash t$\langle$y$\rangle$\textbackslash n}, where \texttt{x} and \texttt{y} are non-negative integers in the range 0 to 32767. Although limiting, this range is a design decision taken in order to avoid unnecessary complexity and keep focus on the main objective of the project, efficient sorting. Including negative numbers would for example require extra logic for parsing, comparison, converting back and forth to ASCII, and allocating memory. Both input and output follow this strict format. \subsection{Selecting algorithms} The ability to choose which algorithm to use can be quite beneficial, if you know certain aspects about your data. For example \textit{Quicksort} is very good with random data, but bad when the data is sorted, and \textit{insertionsort} is good with sorted data, but bad with random data. For that reason, and because it makes testing different algorithms easier, because it is more flexible, the program will be able to run with different algorithms passed as arguments. \subsection{Why Insertionsort and Quicksort?} Insertion sort and quick sort have been handpicked to research the trade-offs between algorithmic simplicity and computational efficiency. Furthermore, these algorithms are widely different in nature. Insertion sort is an iterative algorithm, depending primarily on memory usage, registers, and program flow, while quick sort is a recursive algorithm using the divide and conquer approach. This is interesting, because a Quicksort is expected to excel on large random datasets, while insertion sort is more suitable for small or nearly sorted cases, which is another great motivation to understand what is really happening when these algorithms are executed, and how the dataset affects them. \subsection{Sorting Format} In order to be a little more general, and make our program easily expandable to different types of data, we wanted to create our own array datatype, that allows us to sort something more abstract, with respect to a key in that abstract object. To do this we settled on an array of memory addresses, that point to where the abstract objects are, see \autoref{fig:array}. These objects would in practice also be represented as arrays. With this we can specify which index (key) in the sub-arrays we want to sort by, and then when we swap elements in the array, we simply swap the memory addresses around. \begin{figure}[H] \centering \begin{tikzpicture} \draw (0,0) rectangle (4,1); % outer box \node at (-0.5,0.5) {$A=$}; % vertical dividers \draw (1,0) -- (1,1); \draw (2,0) -- (2,1); \draw (3,0) -- (3,1); \node at (0.5,0.5) {$p_1$}; \node at (1.5,0.5) {$p_2$}; \node at (2.5,0.5) {$\cdots$}; \node at (3.5,0.5) {$p_n$}; \draw (0.1,2) rectangle (0.8,5); \node at (0.5,2.5) {$e_{11}$}; \node at (0.45,3.6) {$\vdots$}; \node at (0.5,4.5) {$e_{1m}$}; \draw (1.1,2) rectangle (1.8,5); \node at (1.5,2.5) {$e_{21}$}; \node at (1.45,3.6) {$\vdots$}; \node at (1.5,4.5) {$e_{2m}$}; \draw (3.1,2) rectangle (3.8,5); \node at (3.5,2.5) {$e_{n1}$}; \node at (3.45,3.6) {$\vdots$}; \node at (3.5,4.5) {$e_{nm}$}; \draw[->] (0.5,1.1) -- (0.5,1.9); \draw[->] (1.5,1.1) -- (1.5,1.9); \draw[->] (3.5,1.1) -- (3.5,1.9); \end{tikzpicture} \caption{The format of the array $A$, where $p_n$ is the $n^\text{th}$ memory address, pointing to an array where $e_{nm}$ is the $m^\text{th}$ element of the $n^\text{th}$ array in $A$.} \label{fig:array} \end{figure} \section{Implementation} This four phase design has been converted into Assembly code using the x86\_64 instruction set. Furthermore the design has been split up in modules, where each module is a collection of closely related functions. For example \textit{array\_maker.s} contains functions meant to parse data and convert it to the array format. All functions in this program follow the x86 calling conventions, with a few exceptions where all registers are preserved. Below is a detailed explanation of some of the more interesting functions. \subsection{Array maker} We implemented a global function \texttt{array\_maker}, that takes the \textit{file descriptor} as input, and returns a pointer to the start of the array, and the number of elements. With this it then first calls a function \texttt{create\_file\_buffer}, that reads the contents of the file and puts it in memory. It then passes this address, and the length of the file onto the function \texttt{make\_pairwise\_data} \begin{figure}[H] \snippet{19}{46}{../src/array_maker.s} \caption{Code snippet from \texttt{array\_maker.s}}\label{fig:arraysnippet} \end{figure} Load file in memory create pairwise data allocate space for the array Go through each element in the array and create a pointer that points to a pair in the pairwise data. \subsection{Insertion-sort} This is a classic implementation of the iterative insertion-sort. The function takes three arguments: address of the array of pointers to each coordinate, number of coordinates and which coordinate dimension to sort by. The first step of this algorithm is to check the number of coordinates with a compare statement, if n $\le$ 1, the array is already sorted and nothing should be done. Otherwise the algorithm initializes two indices i and j, representing the outer and inner loop in insertion-sort. A snippet of the algorithm is seen below. \begin{algorithm}[H] \caption{Insertion-sort inner Loop} \SetAlgoNoLine \SetKw{KwCmp}{cmpq} \SetKw{KwJl}{jl} \SetKw{KwMov}{movq} \SetKw{KwJge}{jge} \SetKw{KwDec}{decq} \SetKw{KwJmp}{jmp} \textbf{inner\_loop:} \\ \KwCmp{$0$, \%r14} \tcp*[r]{if j < 0 stop and insert the key} \KwJl insert\_key \\ \KwMov{$(\%rdi, \%r14, 8)$, \%r10} \tcp*[r]{pointer to key → r10} \KwMov{$(\%r10, \%rdx, 8)$, \%r11} \tcp*[r]{key → r11} \KwCmp{\%r11, \%r9} \\ \KwJge insert\_key \tcp*[r]{if A[i] >= A[j] stop} \KwMov{\%r10, $8(\%rdi, \%r14, 8)$} \tcp*[r]{A[j+1] = pointer to key of j} \KwDec{\%r14} \tcp*[r]{move to j - 1} \KwJmp inner\_loop \\ \textbf{insert\_key:} \tcp*[r]{Insert the key here} \end{algorithm} The two registers R8 and R9 save the current index and value of "i" while R10 and R11 are used for "j". The formula for loading the value of an element consists of two steps: firstly loading the pointer to the element, and secondly loading the actual value of this pointer. This is a result of implementing an array of pointers to allow more flexibility, instead of directly using the array coordinates. Lines 3 and 4 of the pseudo code demonstrate how this process is done. Each pointer is 8-bytes, therefore the correct pointer is calculated by adding index * 8 to RDI, which stores the head of the pointer array. Next, because each value is also 8 bytes, RDX * 8 to the pointer, where RDX is the coordinate dimension the algorithm uses to sort, can be used to retrieve the actual value of this pointer. After loading the values, compare statements can be used to find the correct placement of each coordinate. Like the standard insertion-sort, this implementation creates a sorted part on the left side of the array. Therefore the first compare checks if A[i] is equal or bigger than A[j], if this is the case A[i] can not be smaller than any element at j or before. Otherwise the algorithm keeps searching for the correct placement of A[i] in the sorted part by decreasing j and repeating the inner loop. If j at some point hits 0, the element A[i] is the current smallest coordinate and should be inserted there. This process continues until R15 which holds the index of "i" is greater than the number of coordinates. \section{Evaluation} To evaluate whether out program works, we created three \textit{shell scripts}, one for generating test data (\textit{generate\_test\_data.sh}), one for testing the validity of the output of the program (\textit{test.sh}) and one for testing the execution time of the program (\textit{benchmark.sh}). The script that generates test files, generates files of size 0, 5000, 10000, 50000, 100000, 500000 and 1000000. (a size of 10, means the file consists of 10 coordinates). For each $n$ we also create three different kinds of test files. One where all of the data is random, one where the data is already sorted, and one where it is reversely sorted.\footnote{We also did run it for $n=$ 1 million with random, but to get a pretty plot we decided to omit these values, so it was not too sparse. Quicksort ran for 0.159s and Insertionsort ran for 386.935s. This makes Quicksort roughly 2400 times faster than Insertionsort.} This way we can see both the average case, and worst case of any algorithm we decide to use. Now we can use our \textit{test.sh} script, which runs the program with each of the generated test files for all sorting algorithms of the program, and compares the output with the output from the \textit{builtin} function \textit{sort}. Given all of the tests pass, we can then run \textit{benchmark.sh}, which times how long it takes for our program to run with each test file for each algorithm, and saves the result in a \textit{csv} file. The results of this \textit{csv} file, together with functions asymptotically equivalent to $\mathcal O(n)$, $\mathcal O(n\log n)$ and $\mathcal O(n^2)$, is shown on \autoref{fig:benchmark}. \begin{figure}[H] \centering \begin{tikzpicture} \begin{axis}[ xlabel={size({$n$})}, ylabel={seconds}, ymode=log, % xmode=log, width=\textwidth, legend pos=north west, clip mode=individual, scatter/classes={ isort_random={mark=*,red}, isort_reverse={mark=*,green}, isort_sorted={mark=*,blue}, qsort_random={mark=*,cyan}, qsort_reverse={mark=*,magenta}, qsort_sorted={mark=*,yellow} } ] % -- Scatter plot -- \addplot[scatter, only marks, scatter src=explicit symbolic] table [x=size, y=time, meta=type, col sep=comma] {../test/benchmark_results.csv}; \legend{isort\_random, isort\_reverse, isort\_sorted, qsort\_random, qsort\_reverse, qsort\_sorted} % -- O(n) \addplot[domain=1:100000, samples=100, ultra thick, densely dotted, black] {(x * 1/18000000+0.0003)}; \addlegendentry{$\mathcal O(n)$} % -- O(nlogn) \addplot[domain=1:100000, samples=100, ultra thick, dotted, black] {(x*ln(x)) * 1/80000000 + 0.0003}; \addlegendentry{$\mathcal O(n\log n)$} % -- O(n^2) \addplot[domain=1:100000, samples=100, ultra thick, dashed, black] {(x*x) * 1/5000000000+0.0003}; \addlegendentry{$\mathcal O(n^2)$} \end{axis} \end{tikzpicture} \caption{Benchmarking data from Insertionsort and Quicksort for both random, sorted and reverse sorted data.}\label{fig:benchmark} \end{figure} From \autoref{fig:benchmark} we see that Insertionsort behaves as expected with a best-case runtime of $\mathcal O(n)$ when the input is sorted, and a worst-case runtime of $\mathcal O(n^2)$, when the input reversely sorted or random. We also see that Quicksort behaves as expected with a best-case runtime of $\mathcal O(n\log n)$ when the input is random, and a worst-case runtime of $\mathcal O(n^2)$ when the input is either sorted or reversely sorted. \section{Conclusion} \end{document}