hgbook
diff fr/ch04-concepts.xml @ 979:64475a75365b
Merged from rpelisse
author | Jean-Marie Clément <JeanMarieClement@web.de> |
---|---|
date | Fri Sep 04 18:24:06 2009 +0200 (2009-09-04) |
parents | |
children | e6894aa7baf2 |
line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/fr/ch04-concepts.xml Fri Sep 04 18:24:06 2009 +0200 1.3 @@ -0,0 +1,710 @@ 1.4 +<!-- vim: set filetype=docbkxml shiftwidth=2 autoindent expandtab tw=77 : --> 1.5 + 1.6 +<chapter> 1.7 +<title>Behind the scenes</title> 1.8 +<para>\label{chap:concepts}</para> 1.9 + 1.10 +<para>Unlike many revision control systems, the concepts upon which 1.11 +Mercurial is built are simple enough that it's easy to understand how 1.12 +the software really works. Knowing this certainly isn't necessary, 1.13 +but I find it useful to have a <quote>mental model</quote> of what's going on.</para> 1.14 + 1.15 +<para>This understanding gives me confidence that Mercurial has been 1.16 +carefully designed to be both <emphasis>safe</emphasis> and <emphasis>efficient</emphasis>. And 1.17 +just as importantly, if it's easy for me to retain a good idea of what 1.18 +the software is doing when I perform a revision control task, I'm less 1.19 +likely to be surprised by its behaviour.</para> 1.20 + 1.21 +<para>In this chapter, we'll initially cover the core concepts behind 1.22 +Mercurial's design, then continue to discuss some of the interesting 1.23 +details of its implementation.</para> 1.24 + 1.25 +<sect1> 1.26 +<title>Mercurial's historical record</title> 1.27 + 1.28 +<sect2> 1.29 +<title>Tracking the history of a single file</title> 1.30 + 1.31 +<para>When Mercurial tracks modifications to a file, it stores the history 1.32 +of that file in a metadata object called a <emphasis>filelog</emphasis>. Each entry 1.33 +in the filelog contains enough information to reconstruct one revision 1.34 +of the file that is being tracked. Filelogs are stored as files in 1.35 +the <filename role="special" class="directory">.hg/store/data</filename> directory. A filelog contains two kinds 1.36 +of information: revision data, and an index to help Mercurial to find 1.37 +a revision efficiently.</para> 1.38 + 1.39 +<para>A file that is large, or has a lot of history, has its filelog stored 1.40 +in separate data (<quote><literal>.d</literal></quote> suffix) and index (<quote><literal>.i</literal></quote> 1.41 +suffix) files. For small files without much history, the revision 1.42 +data and index are combined in a single <quote><literal>.i</literal></quote> file. The 1.43 +correspondence between a file in the working directory and the filelog 1.44 +that tracks its history in the repository is illustrated in 1.45 +figure <xref linkend="fig:concepts:filelog"/>.</para> 1.46 + 1.47 +<informalfigure> 1.48 + 1.49 +<para> <mediaobject><imageobject><imagedata fileref="filelog"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> 1.50 + \caption{Relationships between files in working directory and 1.51 + filelogs in repository} 1.52 + \label{fig:concepts:filelog}</para> 1.53 +</informalfigure> 1.54 + 1.55 +</sect2> 1.56 +<sect2> 1.57 +<title>Managing tracked files</title> 1.58 + 1.59 +<para>Mercurial uses a structure called a <emphasis>manifest</emphasis> to collect 1.60 +together information about the files that it tracks. Each entry in 1.61 +the manifest contains information about the files present in a single 1.62 +changeset. An entry records which files are present in the changeset, 1.63 +the revision of each file, and a few other pieces of file metadata.</para> 1.64 + 1.65 +</sect2> 1.66 +<sect2> 1.67 +<title>Recording changeset information</title> 1.68 + 1.69 +<para>The <emphasis>changelog</emphasis> contains information about each changeset. Each 1.70 +revision records who committed a change, the changeset comment, other 1.71 +pieces of changeset-related information, and the revision of the 1.72 +manifest to use. 1.73 +</para> 1.74 + 1.75 +</sect2> 1.76 +<sect2> 1.77 +<title>Relationships between revisions</title> 1.78 + 1.79 +<para>Within a changelog, a manifest, or a filelog, each revision stores a 1.80 +pointer to its immediate parent (or to its two parents, if it's a 1.81 +merge revision). As I mentioned above, there are also relationships 1.82 +between revisions <emphasis>across</emphasis> these structures, and they are 1.83 +hierarchical in nature. 1.84 +</para> 1.85 + 1.86 +<para>For every changeset in a repository, there is exactly one revision 1.87 +stored in the changelog. Each revision of the changelog contains a 1.88 +pointer to a single revision of the manifest. A revision of the 1.89 +manifest stores a pointer to a single revision of each filelog tracked 1.90 +when that changeset was created. These relationships are illustrated 1.91 +in figure <xref linkend="fig:concepts:metadata"/>. 1.92 +</para> 1.93 + 1.94 +<informalfigure> 1.95 + 1.96 +<para> <mediaobject><imageobject><imagedata fileref="metadata"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> 1.97 + <caption><para>Metadata relationships</para></caption> 1.98 + \label{fig:concepts:metadata} 1.99 +</para> 1.100 +</informalfigure> 1.101 + 1.102 +<para>As the illustration shows, there is <emphasis>not</emphasis> a <quote>one to one</quote> 1.103 +relationship between revisions in the changelog, manifest, or filelog. 1.104 +If the manifest hasn't changed between two changesets, the changelog 1.105 +entries for those changesets will point to the same revision of the 1.106 +manifest. If a file that Mercurial tracks hasn't changed between two 1.107 +changesets, the entry for that file in the two revisions of the 1.108 +manifest will point to the same revision of its filelog. 1.109 +</para> 1.110 + 1.111 +</sect2> 1.112 +</sect1> 1.113 +<sect1> 1.114 +<title>Safe, efficient storage</title> 1.115 + 1.116 +<para>The underpinnings of changelogs, manifests, and filelogs are provided 1.117 +by a single structure called the <emphasis>revlog</emphasis>. 1.118 +</para> 1.119 + 1.120 +<sect2> 1.121 +<title>Efficient storage</title> 1.122 + 1.123 +<para>The revlog provides efficient storage of revisions using a 1.124 +<emphasis>delta</emphasis> mechanism. Instead of storing a complete copy of a file 1.125 +for each revision, it stores the changes needed to transform an older 1.126 +revision into the new revision. For many kinds of file data, these 1.127 +deltas are typically a fraction of a percent of the size of a full 1.128 +copy of a file. 1.129 +</para> 1.130 + 1.131 +<para>Some obsolete revision control systems can only work with deltas of 1.132 +text files. They must either store binary files as complete snapshots 1.133 +or encoded into a text representation, both of which are wasteful 1.134 +approaches. Mercurial can efficiently handle deltas of files with 1.135 +arbitrary binary contents; it doesn't need to treat text as special. 1.136 +</para> 1.137 + 1.138 +</sect2> 1.139 +<sect2> 1.140 +<title>Safe operation</title> 1.141 +<para>\label{sec:concepts:txn} 1.142 +</para> 1.143 + 1.144 +<para>Mercurial only ever <emphasis>appends</emphasis> data to the end of a revlog file. 1.145 +It never modifies a section of a file after it has written it. This 1.146 +is both more robust and efficient than schemes that need to modify or 1.147 +rewrite data. 1.148 +</para> 1.149 + 1.150 +<para>In addition, Mercurial treats every write as part of a 1.151 +<emphasis>transaction</emphasis> that can span a number of files. A transaction is 1.152 +<emphasis>atomic</emphasis>: either the entire transaction succeeds and its effects 1.153 +are all visible to readers in one go, or the whole thing is undone. 1.154 +This guarantee of atomicity means that if you're running two copies of 1.155 +Mercurial, where one is reading data and one is writing it, the reader 1.156 +will never see a partially written result that might confuse it. 1.157 +</para> 1.158 + 1.159 +<para>The fact that Mercurial only appends to files makes it easier to 1.160 +provide this transactional guarantee. The easier it is to do stuff 1.161 +like this, the more confident you should be that it's done correctly. 1.162 +</para> 1.163 + 1.164 +</sect2> 1.165 +<sect2> 1.166 +<title>Fast retrieval</title> 1.167 + 1.168 +<para>Mercurial cleverly avoids a pitfall common to all earlier 1.169 +revision control systems: the problem of <emphasis>inefficient retrieval</emphasis>. 1.170 +Most revision control systems store the contents of a revision as an 1.171 +incremental series of modifications against a <quote>snapshot</quote>. To 1.172 +reconstruct a specific revision, you must first read the snapshot, and 1.173 +then every one of the revisions between the snapshot and your target 1.174 +revision. The more history that a file accumulates, the more 1.175 +revisions you must read, hence the longer it takes to reconstruct a 1.176 +particular revision. 1.177 +</para> 1.178 + 1.179 +<informalfigure> 1.180 + 1.181 +<para> <mediaobject><imageobject><imagedata fileref="snapshot"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> 1.182 + <caption><para>Snapshot of a revlog, with incremental deltas</para></caption> 1.183 + \label{fig:concepts:snapshot} 1.184 +</para> 1.185 +</informalfigure> 1.186 + 1.187 +<para>The innovation that Mercurial applies to this problem is simple but 1.188 +effective. Once the cumulative amount of delta information stored 1.189 +since the last snapshot exceeds a fixed threshold, it stores a new 1.190 +snapshot (compressed, of course), instead of another delta. This 1.191 +makes it possible to reconstruct <emphasis>any</emphasis> revision of a file 1.192 +quickly. This approach works so well that it has since been copied by 1.193 +several other revision control systems. 1.194 +</para> 1.195 + 1.196 +<para>Figure <xref linkend="fig:concepts:snapshot"/> illustrates the idea. In an entry 1.197 +in a revlog's index file, Mercurial stores the range of entries from 1.198 +the data file that it must read to reconstruct a particular revision. 1.199 +</para> 1.200 + 1.201 +<sect3> 1.202 +<title>Aside: the influence of video compression</title> 1.203 + 1.204 +<para>If you're familiar with video compression or have ever watched a TV 1.205 +feed through a digital cable or satellite service, you may know that 1.206 +most video compression schemes store each frame of video as a delta 1.207 +against its predecessor frame. In addition, these schemes use 1.208 +<quote>lossy</quote> compression techniques to increase the compression ratio, so 1.209 +visual errors accumulate over the course of a number of inter-frame 1.210 +deltas. 1.211 +</para> 1.212 + 1.213 +<para>Because it's possible for a video stream to <quote>drop out</quote> occasionally 1.214 +due to signal glitches, and to limit the accumulation of artefacts 1.215 +introduced by the lossy compression process, video encoders 1.216 +periodically insert a complete frame (called a <quote>key frame</quote>) into the 1.217 +video stream; the next delta is generated against that frame. This 1.218 +means that if the video signal gets interrupted, it will resume once 1.219 +the next key frame is received. Also, the accumulation of encoding 1.220 +errors restarts anew with each key frame. 1.221 +</para> 1.222 + 1.223 +</sect3> 1.224 +</sect2> 1.225 +<sect2> 1.226 +<title>Identification and strong integrity</title> 1.227 + 1.228 +<para>Along with delta or snapshot information, a revlog entry contains a 1.229 +cryptographic hash of the data that it represents. This makes it 1.230 +difficult to forge the contents of a revision, and easy to detect 1.231 +accidental corruption. 1.232 +</para> 1.233 + 1.234 +<para>Hashes provide more than a mere check against corruption; they are 1.235 +used as the identifiers for revisions. The changeset identification 1.236 +hashes that you see as an end user are from revisions of the 1.237 +changelog. Although filelogs and the manifest also use hashes, 1.238 +Mercurial only uses these behind the scenes. 1.239 +</para> 1.240 + 1.241 +<para>Mercurial verifies that hashes are correct when it retrieves file 1.242 +revisions and when it pulls changes from another repository. If it 1.243 +encounters an integrity problem, it will complain and stop whatever 1.244 +it's doing. 1.245 +</para> 1.246 + 1.247 +<para>In addition to the effect it has on retrieval efficiency, Mercurial's 1.248 +use of periodic snapshots makes it more robust against partial data 1.249 +corruption. If a revlog becomes partly corrupted due to a hardware 1.250 +error or system bug, it's often possible to reconstruct some or most 1.251 +revisions from the uncorrupted sections of the revlog, both before and 1.252 +after the corrupted section. This would not be possible with a 1.253 +delta-only storage model. 1.254 +</para> 1.255 + 1.256 +<para>\section{Revision history, branching, 1.257 + and merging} 1.258 +</para> 1.259 + 1.260 +<para>Every entry in a Mercurial revlog knows the identity of its immediate 1.261 +ancestor revision, usually referred to as its <emphasis>parent</emphasis>. In fact, 1.262 +a revision contains room for not one parent, but two. Mercurial uses 1.263 +a special hash, called the <quote>null ID</quote>, to represent the idea <quote>there 1.264 +is no parent here</quote>. This hash is simply a string of zeroes. 1.265 +</para> 1.266 + 1.267 +<para>In figure <xref linkend="fig:concepts:revlog"/>, you can see an example of the 1.268 +conceptual structure of a revlog. Filelogs, manifests, and changelogs 1.269 +all have this same structure; they differ only in the kind of data 1.270 +stored in each delta or snapshot. 1.271 +</para> 1.272 + 1.273 +<para>The first revision in a revlog (at the bottom of the image) has the 1.274 +null ID in both of its parent slots. For a <quote>normal</quote> revision, its 1.275 +first parent slot contains the ID of its parent revision, and its 1.276 +second contains the null ID, indicating that the revision has only one 1.277 +real parent. Any two revisions that have the same parent ID are 1.278 +branches. A revision that represents a merge between branches has two 1.279 +normal revision IDs in its parent slots. 1.280 +</para> 1.281 + 1.282 +<informalfigure> 1.283 + 1.284 +<para> <mediaobject><imageobject><imagedata fileref="revlog"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> 1.285 + \caption{} 1.286 + \label{fig:concepts:revlog} 1.287 +</para> 1.288 +</informalfigure> 1.289 + 1.290 +</sect2> 1.291 +</sect1> 1.292 +<sect1> 1.293 +<title>The working directory</title> 1.294 + 1.295 +<para>In the working directory, Mercurial stores a snapshot of the files 1.296 +from the repository as of a particular changeset. 1.297 +</para> 1.298 + 1.299 +<para>The working directory <quote>knows</quote> which changeset it contains. When you 1.300 +update the working directory to contain a particular changeset, 1.301 +Mercurial looks up the appropriate revision of the manifest to find 1.302 +out which files it was tracking at the time that changeset was 1.303 +committed, and which revision of each file was then current. It then 1.304 +recreates a copy of each of those files, with the same contents it had 1.305 +when the changeset was committed. 1.306 +</para> 1.307 + 1.308 +<para>The <emphasis>dirstate</emphasis> contains Mercurial's knowledge of the working 1.309 +directory. This details which changeset the working directory is 1.310 +updated to, and all of the files that Mercurial is tracking in the 1.311 +working directory. 1.312 +</para> 1.313 + 1.314 +<para>Just as a revision of a revlog has room for two parents, so that it 1.315 +can represent either a normal revision (with one parent) or a merge of 1.316 +two earlier revisions, the dirstate has slots for two parents. When 1.317 +you use the <command role="hg-cmd">hg update</command> command, the changeset that you update to 1.318 +is stored in the <quote>first parent</quote> slot, and the null ID in the second. 1.319 +When you <command role="hg-cmd">hg merge</command> with another changeset, the first parent 1.320 +remains unchanged, and the second parent is filled in with the 1.321 +changeset you're merging with. The <command role="hg-cmd">hg parents</command> command tells you 1.322 +what the parents of the dirstate are. 1.323 +</para> 1.324 + 1.325 +<sect2> 1.326 +<title>What happens when you commit</title> 1.327 + 1.328 +<para>The dirstate stores parent information for more than just book-keeping 1.329 +purposes. Mercurial uses the parents of the dirstate as \emph{the 1.330 + parents of a new changeset} when you perform a commit. 1.331 +</para> 1.332 + 1.333 +<informalfigure> 1.334 + 1.335 +<para> <mediaobject><imageobject><imagedata fileref="wdir"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> 1.336 + <caption><para>The working directory can have two parents</para></caption> 1.337 + \label{fig:concepts:wdir} 1.338 +</para> 1.339 +</informalfigure> 1.340 + 1.341 +<para>Figure <xref linkend="fig:concepts:wdir"/> shows the normal state of the working 1.342 +directory, where it has a single changeset as parent. That changeset 1.343 +is the <emphasis>tip</emphasis>, the newest changeset in the repository that has no 1.344 +children. 1.345 +</para> 1.346 + 1.347 +<informalfigure> 1.348 + 1.349 +<para> <mediaobject><imageobject><imagedata fileref="wdir-after-commit"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> 1.350 + <caption><para>The working directory gains new parents after a commit</para></caption> 1.351 + \label{fig:concepts:wdir-after-commit} 1.352 +</para> 1.353 +</informalfigure> 1.354 + 1.355 +<para>It's useful to think of the working directory as <quote>the changeset I'm 1.356 +about to commit</quote>. Any files that you tell Mercurial that you've 1.357 +added, removed, renamed, or copied will be reflected in that 1.358 +changeset, as will modifications to any files that Mercurial is 1.359 +already tracking; the new changeset will have the parents of the 1.360 +working directory as its parents. 1.361 +</para> 1.362 + 1.363 +<para>After a commit, Mercurial will update the parents of the working 1.364 +directory, so that the first parent is the ID of the new changeset, 1.365 +and the second is the null ID. This is shown in 1.366 +figure <xref linkend="fig:concepts:wdir-after-commit"/>. Mercurial doesn't touch 1.367 +any of the files in the working directory when you commit; it just 1.368 +modifies the dirstate to note its new parents. 1.369 +</para> 1.370 + 1.371 +</sect2> 1.372 +<sect2> 1.373 +<title>Creating a new head</title> 1.374 + 1.375 +<para>It's perfectly normal to update the working directory to a changeset 1.376 +other than the current tip. For example, you might want to know what 1.377 +your project looked like last Tuesday, or you could be looking through 1.378 +changesets to see which one introduced a bug. In cases like this, the 1.379 +natural thing to do is update the working directory to the changeset 1.380 +you're interested in, and then examine the files in the working 1.381 +directory directly to see their contents as they were when you 1.382 +committed that changeset. The effect of this is shown in 1.383 +figure <xref linkend="fig:concepts:wdir-pre-branch"/>. 1.384 +</para> 1.385 + 1.386 +<informalfigure> 1.387 + 1.388 +<para> <mediaobject><imageobject><imagedata fileref="wdir-pre-branch"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> 1.389 + <caption><para>The working directory, updated to an older changeset</para></caption> 1.390 + \label{fig:concepts:wdir-pre-branch} 1.391 +</para> 1.392 +</informalfigure> 1.393 + 1.394 +<para>Having updated the working directory to an older changeset, what 1.395 +happens if you make some changes, and then commit? Mercurial behaves 1.396 +in the same way as I outlined above. The parents of the working 1.397 +directory become the parents of the new changeset. This new changeset 1.398 +has no children, so it becomes the new tip. And the repository now 1.399 +contains two changesets that have no children; we call these 1.400 +<emphasis>heads</emphasis>. You can see the structure that this creates in 1.401 +figure <xref linkend="fig:concepts:wdir-branch"/>. 1.402 +</para> 1.403 + 1.404 +<informalfigure> 1.405 + 1.406 +<para> <mediaobject><imageobject><imagedata fileref="wdir-branch"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> 1.407 + <caption><para>After a commit made while synced to an older changeset</para></caption> 1.408 + \label{fig:concepts:wdir-branch} 1.409 +</para> 1.410 +</informalfigure> 1.411 + 1.412 +<note> 1.413 +<para> If you're new to Mercurial, you should keep in mind a common 1.414 + <quote>error</quote>, which is to use the <command role="hg-cmd">hg pull</command> command without any 1.415 + options. By default, the <command role="hg-cmd">hg pull</command> command <emphasis>does not</emphasis> 1.416 + update the working directory, so you'll bring new changesets into 1.417 + your repository, but the working directory will stay synced at the 1.418 + same changeset as before the pull. If you make some changes and 1.419 + commit afterwards, you'll thus create a new head, because your 1.420 + working directory isn't synced to whatever the current tip is. 1.421 +</para> 1.422 + 1.423 +<para> I put the word <quote>error</quote> in quotes because all that you need to do 1.424 + to rectify this situation is <command role="hg-cmd">hg merge</command>, then <command role="hg-cmd">hg commit</command>. In 1.425 + other words, this almost never has negative consequences; it just 1.426 + surprises people. I'll discuss other ways to avoid this behaviour, 1.427 + and why Mercurial behaves in this initially surprising way, later 1.428 + on. 1.429 +</para> 1.430 +</note> 1.431 + 1.432 +</sect2> 1.433 +<sect2> 1.434 +<title>Merging heads</title> 1.435 + 1.436 +<para>When you run the <command role="hg-cmd">hg merge</command> command, Mercurial leaves the first 1.437 +parent of the working directory unchanged, and sets the second parent 1.438 +to the changeset you're merging with, as shown in 1.439 +figure <xref linkend="fig:concepts:wdir-merge"/>. 1.440 +</para> 1.441 + 1.442 +<informalfigure> 1.443 + 1.444 +<para> <mediaobject><imageobject><imagedata fileref="wdir-merge"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> 1.445 + <caption><para>Merging two heads</para></caption> 1.446 + \label{fig:concepts:wdir-merge} 1.447 +</para> 1.448 +</informalfigure> 1.449 + 1.450 +<para>Mercurial also has to modify the working directory, to merge the files 1.451 +managed in the two changesets. Simplified a little, the merging 1.452 +process goes like this, for every file in the manifests of both 1.453 +changesets. 1.454 +</para> 1.455 +<itemizedlist> 1.456 +<listitem><para>If neither changeset has modified a file, do nothing with that 1.457 + file. 1.458 +</para> 1.459 +</listitem> 1.460 +<listitem><para>If one changeset has modified a file, and the other hasn't, 1.461 + create the modified copy of the file in the working directory. 1.462 +</para> 1.463 +</listitem> 1.464 +<listitem><para>If one changeset has removed a file, and the other hasn't (or 1.465 + has also deleted it), delete the file from the working directory. 1.466 +</para> 1.467 +</listitem> 1.468 +<listitem><para>If one changeset has removed a file, but the other has modified 1.469 + the file, ask the user what to do: keep the modified file, or remove 1.470 + it? 1.471 +</para> 1.472 +</listitem> 1.473 +<listitem><para>If both changesets have modified a file, invoke an external 1.474 + merge program to choose the new contents for the merged file. This 1.475 + may require input from the user. 1.476 +</para> 1.477 +</listitem> 1.478 +<listitem><para>If one changeset has modified a file, and the other has renamed 1.479 + or copied the file, make sure that the changes follow the new name 1.480 + of the file. 1.481 +</para> 1.482 +</listitem></itemizedlist> 1.483 +<para>There are more details&emdash;merging has plenty of corner cases&emdash;but 1.484 +these are the most common choices that are involved in a merge. As 1.485 +you can see, most cases are completely automatic, and indeed most 1.486 +merges finish automatically, without requiring your input to resolve 1.487 +any conflicts. 1.488 +</para> 1.489 + 1.490 +<para>When you're thinking about what happens when you commit after a merge, 1.491 +once again the working directory is <quote>the changeset I'm about to 1.492 +commit</quote>. After the <command role="hg-cmd">hg merge</command> command completes, the working 1.493 +directory has two parents; these will become the parents of the new 1.494 +changeset. 1.495 +</para> 1.496 + 1.497 +<para>Mercurial lets you perform multiple merges, but you must commit the 1.498 +results of each individual merge as you go. This is necessary because 1.499 +Mercurial only tracks two parents for both revisions and the working 1.500 +directory. While it would be technically possible to merge multiple 1.501 +changesets at once, the prospect of user confusion and making a 1.502 +terrible mess of a merge immediately becomes overwhelming. 1.503 +</para> 1.504 + 1.505 +</sect2> 1.506 +</sect1> 1.507 +<sect1> 1.508 +<title>Other interesting design features</title> 1.509 + 1.510 +<para>In the sections above, I've tried to highlight some of the most 1.511 +important aspects of Mercurial's design, to illustrate that it pays 1.512 +careful attention to reliability and performance. However, the 1.513 +attention to detail doesn't stop there. There are a number of other 1.514 +aspects of Mercurial's construction that I personally find 1.515 +interesting. I'll detail a few of them here, separate from the <quote>big 1.516 +ticket</quote> items above, so that if you're interested, you can gain a 1.517 +better idea of the amount of thinking that goes into a well-designed 1.518 +system. 1.519 +</para> 1.520 + 1.521 +<sect2> 1.522 +<title>Clever compression</title> 1.523 + 1.524 +<para>When appropriate, Mercurial will store both snapshots and deltas in 1.525 +compressed form. It does this by always <emphasis>trying to</emphasis> compress a 1.526 +snapshot or delta, but only storing the compressed version if it's 1.527 +smaller than the uncompressed version. 1.528 +</para> 1.529 + 1.530 +<para>This means that Mercurial does <quote>the right thing</quote> when storing a file 1.531 +whose native form is compressed, such as a <literal>zip</literal> archive or a 1.532 +JPEG image. When these types of files are compressed a second time, 1.533 +the resulting file is usually bigger than the once-compressed form, 1.534 +and so Mercurial will store the plain <literal>zip</literal> or JPEG. 1.535 +</para> 1.536 + 1.537 +<para>Deltas between revisions of a compressed file are usually larger than 1.538 +snapshots of the file, and Mercurial again does <quote>the right thing</quote> in 1.539 +these cases. It finds that such a delta exceeds the threshold at 1.540 +which it should store a complete snapshot of the file, so it stores 1.541 +the snapshot, again saving space compared to a naive delta-only 1.542 +approach. 1.543 +</para> 1.544 + 1.545 +<sect3> 1.546 +<title>Network recompression</title> 1.547 + 1.548 +<para>When storing revisions on disk, Mercurial uses the <quote>deflate</quote> 1.549 +compression algorithm (the same one used by the popular <literal>zip</literal> 1.550 +archive format), which balances good speed with a respectable 1.551 +compression ratio. However, when transmitting revision data over a 1.552 +network connection, Mercurial uncompresses the compressed revision 1.553 +data. 1.554 +</para> 1.555 + 1.556 +<para>If the connection is over HTTP, Mercurial recompresses the entire 1.557 +stream of data using a compression algorithm that gives a better 1.558 +compression ratio (the Burrows-Wheeler algorithm from the widely used 1.559 +<literal>bzip2</literal> compression package). This combination of algorithm 1.560 +and compression of the entire stream (instead of a revision at a time) 1.561 +substantially reduces the number of bytes to be transferred, yielding 1.562 +better network performance over almost all kinds of network. 1.563 +</para> 1.564 + 1.565 +<para>(If the connection is over <command>ssh</command>, Mercurial <emphasis>doesn't</emphasis> 1.566 +recompress the stream, because <command>ssh</command> can already do this 1.567 +itself.) 1.568 +</para> 1.569 + 1.570 +</sect3> 1.571 +</sect2> 1.572 +<sect2> 1.573 +<title>Read/write ordering and atomicity</title> 1.574 + 1.575 +<para>Appending to files isn't the whole story when it comes to guaranteeing 1.576 +that a reader won't see a partial write. If you recall 1.577 +figure <xref linkend="fig:concepts:metadata"/>, revisions in the changelog point to 1.578 +revisions in the manifest, and revisions in the manifest point to 1.579 +revisions in filelogs. This hierarchy is deliberate. 1.580 +</para> 1.581 + 1.582 +<para>A writer starts a transaction by writing filelog and manifest data, 1.583 +and doesn't write any changelog data until those are finished. A 1.584 +reader starts by reading changelog data, then manifest data, followed 1.585 +by filelog data. 1.586 +</para> 1.587 + 1.588 +<para>Since the writer has always finished writing filelog and manifest data 1.589 +before it writes to the changelog, a reader will never read a pointer 1.590 +to a partially written manifest revision from the changelog, and it will 1.591 +never read a pointer to a partially written filelog revision from the 1.592 +manifest. 1.593 +</para> 1.594 + 1.595 +</sect2> 1.596 +<sect2> 1.597 +<title>Concurrent access</title> 1.598 + 1.599 +<para>The read/write ordering and atomicity guarantees mean that Mercurial 1.600 +never needs to <emphasis>lock</emphasis> a repository when it's reading data, even 1.601 +if the repository is being written to while the read is occurring. 1.602 +This has a big effect on scalability; you can have an arbitrary number 1.603 +of Mercurial processes safely reading data from a repository safely 1.604 +all at once, no matter whether it's being written to or not. 1.605 +</para> 1.606 + 1.607 +<para>The lockless nature of reading means that if you're sharing a 1.608 +repository on a multi-user system, you don't need to grant other local 1.609 +users permission to <emphasis>write</emphasis> to your repository in order for them 1.610 +to be able to clone it or pull changes from it; they only need 1.611 +<emphasis>read</emphasis> permission. (This is <emphasis>not</emphasis> a common feature among 1.612 +revision control systems, so don't take it for granted! Most require 1.613 +readers to be able to lock a repository to access it safely, and this 1.614 +requires write permission on at least one directory, which of course 1.615 +makes for all kinds of nasty and annoying security and administrative 1.616 +problems.) 1.617 +</para> 1.618 + 1.619 +<para>Mercurial uses locks to ensure that only one process can write to a 1.620 +repository at a time (the locking mechanism is safe even over 1.621 +filesystems that are notoriously hostile to locking, such as NFS). If 1.622 +a repository is locked, a writer will wait for a while to retry if the 1.623 +repository becomes unlocked, but if the repository remains locked for 1.624 +too long, the process attempting to write will time out after a while. 1.625 +This means that your daily automated scripts won't get stuck forever 1.626 +and pile up if a system crashes unnoticed, for example. (Yes, the 1.627 +timeout is configurable, from zero to infinity.) 1.628 +</para> 1.629 + 1.630 +<sect3> 1.631 +<title>Safe dirstate access</title> 1.632 + 1.633 +<para>As with revision data, Mercurial doesn't take a lock to read the 1.634 +dirstate file; it does acquire a lock to write it. To avoid the 1.635 +possibility of reading a partially written copy of the dirstate file, 1.636 +Mercurial writes to a file with a unique name in the same directory as 1.637 +the dirstate file, then renames the temporary file atomically to 1.638 +<filename>dirstate</filename>. The file named <filename>dirstate</filename> is thus 1.639 +guaranteed to be complete, not partially written. 1.640 +</para> 1.641 + 1.642 +</sect3> 1.643 +</sect2> 1.644 +<sect2> 1.645 +<title>Avoiding seeks</title> 1.646 + 1.647 +<para>Critical to Mercurial's performance is the avoidance of seeks of the 1.648 +disk head, since any seek is far more expensive than even a 1.649 +comparatively large read operation. 1.650 +</para> 1.651 + 1.652 +<para>This is why, for example, the dirstate is stored in a single file. If 1.653 +there were a dirstate file per directory that Mercurial tracked, the 1.654 +disk would seek once per directory. Instead, Mercurial reads the 1.655 +entire single dirstate file in one step. 1.656 +</para> 1.657 + 1.658 +<para>Mercurial also uses a <quote>copy on write</quote> scheme when cloning a 1.659 +repository on local storage. Instead of copying every revlog file 1.660 +from the old repository into the new repository, it makes a <quote>hard 1.661 +link</quote>, which is a shorthand way to say <quote>these two names point to the 1.662 +same file</quote>. When Mercurial is about to write to one of a revlog's 1.663 +files, it checks to see if the number of names pointing at the file is 1.664 +greater than one. If it is, more than one repository is using the 1.665 +file, so Mercurial makes a new copy of the file that is private to 1.666 +this repository. 1.667 +</para> 1.668 + 1.669 +<para>A few revision control developers have pointed out that this idea of 1.670 +making a complete private copy of a file is not very efficient in its 1.671 +use of storage. While this is true, storage is cheap, and this method 1.672 +gives the highest performance while deferring most book-keeping to the 1.673 +operating system. An alternative scheme would most likely reduce 1.674 +performance and increase the complexity of the software, each of which 1.675 +is much more important to the <quote>feel</quote> of day-to-day use. 1.676 +</para> 1.677 + 1.678 +</sect2> 1.679 +<sect2> 1.680 +<title>Other contents of the dirstate</title> 1.681 + 1.682 +<para>Because Mercurial doesn't force you to tell it when you're modifying a 1.683 +file, it uses the dirstate to store some extra information so it can 1.684 +determine efficiently whether you have modified a file. For each file 1.685 +in the working directory, it stores the time that it last modified the 1.686 +file itself, and the size of the file at that time. 1.687 +</para> 1.688 + 1.689 +<para>When you explicitly <command role="hg-cmd">hg add</command>, <command role="hg-cmd">hg remove</command>, <command role="hg-cmd">hg rename</command> or 1.690 +<command role="hg-cmd">hg copy</command> files, Mercurial updates the dirstate so that it knows 1.691 +what to do with those files when you commit. 1.692 +</para> 1.693 + 1.694 +<para>When Mercurial is checking the states of files in the working 1.695 +directory, it first checks a file's modification time. If that has 1.696 +not changed, the file must not have been modified. If the file's size 1.697 +has changed, the file must have been modified. If the modification 1.698 +time has changed, but the size has not, only then does Mercurial need 1.699 +to read the actual contents of the file to see if they've changed. 1.700 +Storing these few extra pieces of information dramatically reduces the 1.701 +amount of data that Mercurial needs to read, which yields large 1.702 +performance improvements compared to other revision control systems. 1.703 +</para> 1.704 + 1.705 +</sect2> 1.706 +</sect1> 1.707 +</chapter> 1.708 + 1.709 +<!-- 1.710 +local variables: 1.711 +sgml-parent-document: ("00book.xml" "book" "chapter") 1.712 +end: 1.713 +--> 1.714 \ No newline at end of file