# HG changeset patch # User Bryan O'Sullivan # Date 1163200189 28800 # Node ID 75c076c7a374b62ef7502cec05290a9ffa7d79af # Parent 1b67dc96f27ab5d6816fdf4fa47af08452b87d8a More concepts stuff. diff -r 1b67dc96f27a -r 75c076c7a374 en/Makefile --- a/en/Makefile Fri Nov 10 12:42:00 2006 -0800 +++ b/en/Makefile Fri Nov 10 15:09:49 2006 -0800 @@ -25,6 +25,7 @@ kdiff3.png \ metadata.svg \ mq-stack.svg \ + snapshot.svg \ tour-history.svg \ tour-merge-conflict.svg \ tour-merge-merge.svg \ diff -r 1b67dc96f27a -r 75c076c7a374 en/concepts.tex --- a/en/concepts.tex Fri Nov 10 12:42:00 2006 -0800 +++ b/en/concepts.tex Fri Nov 10 15:09:49 2006 -0800 @@ -77,15 +77,15 @@ \label{fig:concepts:metadata} \end{figure} -Note that there is not a ``one to one'' relationship between revisions -in these different metadata files. If the manifest hasn't changed -between two changesets, their changelog entries will point to the same -revision of the manifest. If a file that Mercurial tracks hasn't -changed between two changesets, the entry for that file in the two -revisions of the manifest will point to the same revision of its -filelog. - -\section{An efficient, unified, safe storage mechanism} +As the illustration shows, there is \emph{not} a ``one to one'' +relationship between revisions in the changelog, manifest, or filelog. +If the manifest hasn't changed between two changesets, the changelog +entries for those changesets will point to the same revision of the +manifest. If a file that Mercurial tracks hasn't changed between two +changesets, the entry for that file in the two revisions of the +manifest will point to the same revision of its filelog. + +\section{Safe, efficient storage} The underpinnings of changelogs, manifests, and filelogs are provided by a single structure called the \emph{revlog}. @@ -136,13 +136,24 @@ revisions you must read, hence the longer it takes to reconstruct a particular revision. +\begin{figure}[ht] + \centering + \grafix{snapshot} + \caption{Snapshot of a revlog, with incremental deltas} + \label{fig:concepts:snapshot} +\end{figure} + The innovation that Mercurial applies to this problem is simple but effective. Once the cumulative amount of delta information stored since the last snapshot exceeds a fixed threshold, it stores a new snapshot (compressed, of course), instead of another delta. This makes it possible to reconstruct \emph{any} revision of a file -quickly. This approach works so well that it has subsequently been -copied by several other revision control systems. +quickly. This approach works so well that it has since been copied by +several other revision control systems. + +Figure~\ref{fig:concepts:snapshot} illustrates the idea. In an entry +in a revlog's index file, Mercurial stores the range of entries from +the data file that it must read to reconstruct a particular revision. \subsubsection{Aside: the influence of video compression} @@ -163,26 +174,6 @@ the next key frame is received. Also, the accumulation of encoding errors restarts anew with each key frame. -\subsection{Clever compression} - -When appropriate, Mercurial will store both snapshots and deltas in -compressed form. It does this by always \emph{trying to} compress a -snapshot or delta, but only storing the compressed version if it's -smaller than the uncompressed version. - -This means that Mercurial does ``the right thing'' when storing a file -whose native form is compressed, such as a \texttt{zip} archive or a -JPEG image. When these types of files are compressed a second time, -the resulting file is usually bigger than the once-compressed form, -and so Mercurial will store the plain \texttt{zip} or JPEG. - -Deltas between revisions of a compressed file are usually larger than -snapshots of the file, and Mercurial again does ``the right thing'' in -these cases. It finds that such a delta exceeds the threshold at -which it should store a complete snapshot of the file, so it stores -the snapshot, again saving space compared to a naive delta-only -approach. - \subsection{Strong integrity} Along with delta or snapshot information, a revlog entry contains a @@ -202,6 +193,83 @@ after the corrupted section. This would not be possible with a delta-only storage model. +\section{The working directory} + +Mercurial's good ideas are not confined to the repository; it also +needs to manage the working directory. The \emph{dirstate} contains +Mercurial's knowledge of the working directory. This details which +revision(s) the working directory is updated to, and all files that +Mercurial is tracking in the working directory. + +Because Mercurial doesn't force you to tell it when you're modifying a +file, it uses the dirstate to store some extra information so it can +determine efficiently whether you have modified a file. For each file +in the working directory, it stores the time that it last modified the +file itself, and the size of the file at that time. + +When Mercurial is checking the states of files in the working +directory, it first checks a file's modification time. If that has +not changed, the file must not have been modified. If the file's size +has changed, the file must have been modified. If the modification +time has changed, but the size has not, only then does Mercurial need +to read the actual contents of the file to see if they've changed. +Storing these few extra pieces of information dramatically reduces the +amount of data that Mercurial needs to read, which yields large +performance improvements compared to other revision control systems. + +\section{Other interesting design features} + +In the sections above, I've tried to highlight some of the most +important aspects of Mercurial's design, to illustrate that it pays +careful attention to reliability and performance. However, the +attention to detail doesn't stop there. There are a number of other +aspects of Mercurial's construction that I personally find +interesting. I'll detail a few of them here, separate from the ``big +ticket'' items above, so that if you're interested, you can gain a +better idea of the amount of thinking that goes into a well-designed +system. + +\subsection{Clever compression} + +When appropriate, Mercurial will store both snapshots and deltas in +compressed form. It does this by always \emph{trying to} compress a +snapshot or delta, but only storing the compressed version if it's +smaller than the uncompressed version. + +This means that Mercurial does ``the right thing'' when storing a file +whose native form is compressed, such as a \texttt{zip} archive or a +JPEG image. When these types of files are compressed a second time, +the resulting file is usually bigger than the once-compressed form, +and so Mercurial will store the plain \texttt{zip} or JPEG. + +Deltas between revisions of a compressed file are usually larger than +snapshots of the file, and Mercurial again does ``the right thing'' in +these cases. It finds that such a delta exceeds the threshold at +which it should store a complete snapshot of the file, so it stores +the snapshot, again saving space compared to a naive delta-only +approach. + +\subsubsection{Network recompression} + +When storing revisions on disk, Mercurial uses the ``deflate'' +compression algorithm (the same one used by the popular \texttt{zip} +archive format), which balances good speed with a respectable +compression ratio. However, when transmitting revision data over a +network connection, Mercurial uncompresses the compressed revision +data. + +If the connection is over HTTP, Mercurial recompresses the entire +stream of data using a compression algorithm that gives a etter +compression ratio (the Burrows-Wheeler algorithm from the widely used +\texttt{bzip2} compression package). This combination of algorithm +and compression of the entire stream (instead of a revision at a time) +substantially reduces the number of bytes to be transferred, yielding +better network performance over almost all kinds of network. + +(If the connection is over \command{ssh}, Mercurial \emph{doesn't} +recompress the stream, because \command{ssh} can already do this +itself.) + \subsection{Read/write ordering and atomicity} Appending to files isn't the whole story when it comes to guaranteeing @@ -241,15 +309,25 @@ makes for all kinds of nasty and annoying security and administrative problems.) -Mercurial uses a locking mechanism to ensure that only one process can -write to a repository at a time. This locking mechanism is safe even -over filesystems that are notoriously unsafe for locking, such as NFS. -If a repository is locked, a writer will wait for a while to retry if -the repository becomes unlocked, but if the repository remains locked -for too long, the process attempting to write will time out after a -while. This means that your daily automated scripts won't get stuck -forever and pile up if a system crashes unnoticed, for example. (Yes, -the timeout is configurable, from zero to infinity.) +Mercurial uses locks to ensure that only one process can write to a +repository at a time (the locking mechanism is safe even over +filesystems that are notoriously hostile to locking, such as NFS). If +a repository is locked, a writer will wait for a while to retry if the +repository becomes unlocked, but if the repository remains locked for +too long, the process attempting to write will time out after a while. +This means that your daily automated scripts won't get stuck forever +and pile up if a system crashes unnoticed, for example. (Yes, the +timeout is configurable, from zero to infinity.) + +\subsubsection{Safe dirstate access} + +As with revision data, Mercurial doesn't take a lock to read the +dirstate file; it does acquire a lock to write it. To avoid the +possibility of reading a partially written copy of the dirstate file, +Mercurial writes to a file with a unique name in the same directory as +the dirstate file, then renames the temporary file atomically to +\filename{dirstate}. The file named \filename{dirstate} is thus +guaranteed to be complete, not partially written. diff -r 1b67dc96f27a -r 75c076c7a374 en/metadata.svg --- a/en/metadata.svg Fri Nov 10 12:42:00 2006 -0800 +++ b/en/metadata.svg Fri Nov 10 15:09:49 2006 -0800 @@ -67,7 +67,7 @@ id="layer1"> + y="446.68359" /> + y="446.68359" /> + y="446.68359" /> - + @@ -298,12 +298,12 @@ xml:space="preserve" style="font-size:12px;font-style:normal;font-weight:normal;fill:black;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;font-family:Times New Roman" x="82.072548" - y="436.64789" + y="432.64789" id="text3094">Changelog + y="432.64789">Changelog