hgbook
diff en/ch04-concepts.xml @ 565:8a9c66da6fcb
Fix thinko
author | Bryan O'Sullivan <bos@serpentine.com> |
---|---|
date | Mon Mar 09 21:40:12 2009 -0700 (2009-03-09) |
parents | f72b7e6cbe90 |
children | 13513d2a128d |
line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/en/ch04-concepts.xml Mon Mar 09 21:40:12 2009 -0700 1.3 @@ -0,0 +1,725 @@ 1.4 +<!-- vim: set filetype=docbkxml shiftwidth=2 autoindent expandtab tw=77 : --> 1.5 + 1.6 +<chapter id="chap:concepts"> 1.7 + <title>Behind the scenes</title> 1.8 + 1.9 + <para>Unlike many revision control systems, the concepts upon which 1.10 + Mercurial is built are simple enough that it's easy to understand 1.11 + how the software really works. Knowing this certainly isn't 1.12 + necessary, but I find it useful to have a <quote>mental 1.13 + model</quote> of what's going on.</para> 1.14 + 1.15 + <para>This understanding gives me confidence that Mercurial has been 1.16 + carefully designed to be both <emphasis>safe</emphasis> and 1.17 + <emphasis>efficient</emphasis>. And just as importantly, if it's 1.18 + easy for me to retain a good idea of what the software is doing 1.19 + when I perform a revision control task, I'm less likely to be 1.20 + surprised by its behaviour.</para> 1.21 + 1.22 + <para>In this chapter, we'll initially cover the core concepts 1.23 + behind Mercurial's design, then continue to discuss some of the 1.24 + interesting details of its implementation.</para> 1.25 + 1.26 + <sect1> 1.27 + <title>Mercurial's historical record</title> 1.28 + 1.29 + <sect2> 1.30 + <title>Tracking the history of a single file</title> 1.31 + 1.32 + <para>When Mercurial tracks modifications to a file, it stores 1.33 + the history of that file in a metadata object called a 1.34 + <emphasis>filelog</emphasis>. Each entry in the filelog 1.35 + contains enough information to reconstruct one revision of the 1.36 + file that is being tracked. Filelogs are stored as files in 1.37 + the <filename role="special" 1.38 + class="directory">.hg/store/data</filename> directory. A 1.39 + filelog contains two kinds of information: revision data, and 1.40 + an index to help Mercurial to find a revision 1.41 + efficiently.</para> 1.42 + 1.43 + <para>A file that is large, or has a lot of history, has its 1.44 + filelog stored in separate data 1.45 + (<quote><literal>.d</literal></quote> suffix) and index 1.46 + (<quote><literal>.i</literal></quote> suffix) files. For 1.47 + small files without much history, the revision data and index 1.48 + are combined in a single <quote><literal>.i</literal></quote> 1.49 + file. The correspondence between a file in the working 1.50 + directory and the filelog that tracks its history in the 1.51 + repository is illustrated in figure <xref 1.52 + linkend="fig:concepts:filelog"/>.</para> 1.53 + 1.54 + <informalfigure id="fig:concepts:filelog"> 1.55 + <mediaobject><imageobject><imagedata 1.56 + fileref="filelog"/></imageobject><textobject><phrase>XXX 1.57 + add text</phrase></textobject> 1.58 + <caption><para>Relationships between files in working 1.59 + directory and filelogs in 1.60 + repository</para></caption></mediaobject> 1.61 + </informalfigure> 1.62 + 1.63 + </sect2> 1.64 + <sect2> 1.65 + <title>Managing tracked files</title> 1.66 + 1.67 + <para>Mercurial uses a structure called a 1.68 + <emphasis>manifest</emphasis> to collect together information 1.69 + about the files that it tracks. Each entry in the manifest 1.70 + contains information about the files present in a single 1.71 + changeset. An entry records which files are present in the 1.72 + changeset, the revision of each file, and a few other pieces 1.73 + of file metadata.</para> 1.74 + 1.75 + </sect2> 1.76 + <sect2> 1.77 + <title>Recording changeset information</title> 1.78 + 1.79 + <para>The <emphasis>changelog</emphasis> contains information 1.80 + about each changeset. Each revision records who committed a 1.81 + change, the changeset comment, other pieces of 1.82 + changeset-related information, and the revision of the 1.83 + manifest to use.</para> 1.84 + 1.85 + </sect2> 1.86 + <sect2> 1.87 + <title>Relationships between revisions</title> 1.88 + 1.89 + <para>Within a changelog, a manifest, or a filelog, each 1.90 + revision stores a pointer to its immediate parent (or to its 1.91 + two parents, if it's a merge revision). As I mentioned above, 1.92 + there are also relationships between revisions 1.93 + <emphasis>across</emphasis> these structures, and they are 1.94 + hierarchical in nature.</para> 1.95 + 1.96 + <para>For every changeset in a repository, there is exactly one 1.97 + revision stored in the changelog. Each revision of the 1.98 + changelog contains a pointer to a single revision of the 1.99 + manifest. A revision of the manifest stores a pointer to a 1.100 + single revision of each filelog tracked when that changeset 1.101 + was created. These relationships are illustrated in figure 1.102 + <xref linkend="fig:concepts:metadata"/>.</para> 1.103 + 1.104 + <informalfigure id="fig:concepts:metadata"> 1.105 + <mediaobject><imageobject><imagedata 1.106 + fileref="metadata"/></imageobject><textobject><phrase>XXX 1.107 + add text</phrase></textobject><caption><para>Metadata 1.108 + relationships</para></caption> 1.109 + </mediaobject> 1.110 + </informalfigure> 1.111 + 1.112 + <para>As the illustration shows, there is 1.113 + <emphasis>not</emphasis> a <quote>one to one</quote> 1.114 + relationship between revisions in the changelog, manifest, or 1.115 + filelog. If the manifest hasn't changed between two 1.116 + changesets, the changelog entries for those changesets will 1.117 + point to the same revision of the manifest. If a file that 1.118 + Mercurial tracks hasn't changed between two changesets, the 1.119 + entry for that file in the two revisions of the manifest will 1.120 + point to the same revision of its filelog.</para> 1.121 + 1.122 + </sect2> 1.123 + </sect1> 1.124 + <sect1> 1.125 + <title>Safe, efficient storage</title> 1.126 + 1.127 + <para>The underpinnings of changelogs, manifests, and filelogs are 1.128 + provided by a single structure called the 1.129 + <emphasis>revlog</emphasis>.</para> 1.130 + 1.131 + <sect2> 1.132 + <title>Efficient storage</title> 1.133 + 1.134 + <para>The revlog provides efficient storage of revisions using a 1.135 + <emphasis>delta</emphasis> mechanism. Instead of storing a 1.136 + complete copy of a file for each revision, it stores the 1.137 + changes needed to transform an older revision into the new 1.138 + revision. For many kinds of file data, these deltas are 1.139 + typically a fraction of a percent of the size of a full copy 1.140 + of a file.</para> 1.141 + 1.142 + <para>Some obsolete revision control systems can only work with 1.143 + deltas of text files. They must either store binary files as 1.144 + complete snapshots or encoded into a text representation, both 1.145 + of which are wasteful approaches. Mercurial can efficiently 1.146 + handle deltas of files with arbitrary binary contents; it 1.147 + doesn't need to treat text as special.</para> 1.148 + 1.149 + </sect2> 1.150 + <sect2 id="sec:concepts:txn"> 1.151 + <title>Safe operation</title> 1.152 + 1.153 + <para>Mercurial only ever <emphasis>appends</emphasis> data to 1.154 + the end of a revlog file. It never modifies a section of a 1.155 + file after it has written it. This is both more robust and 1.156 + efficient than schemes that need to modify or rewrite 1.157 + data.</para> 1.158 + 1.159 + <para>In addition, Mercurial treats every write as part of a 1.160 + <emphasis>transaction</emphasis> that can span a number of 1.161 + files. A transaction is <emphasis>atomic</emphasis>: either 1.162 + the entire transaction succeeds and its effects are all 1.163 + visible to readers in one go, or the whole thing is undone. 1.164 + This guarantee of atomicity means that if you're running two 1.165 + copies of Mercurial, where one is reading data and one is 1.166 + writing it, the reader will never see a partially written 1.167 + result that might confuse it.</para> 1.168 + 1.169 + <para>The fact that Mercurial only appends to files makes it 1.170 + easier to provide this transactional guarantee. The easier it 1.171 + is to do stuff like this, the more confident you should be 1.172 + that it's done correctly.</para> 1.173 + 1.174 + </sect2> 1.175 + <sect2> 1.176 + <title>Fast retrieval</title> 1.177 + 1.178 + <para>Mercurial cleverly avoids a pitfall common to all earlier 1.179 + revision control systems: the problem of <emphasis>inefficient 1.180 + retrieval</emphasis>. Most revision control systems store 1.181 + the contents of a revision as an incremental series of 1.182 + modifications against a <quote>snapshot</quote>. To 1.183 + reconstruct a specific revision, you must first read the 1.184 + snapshot, and then every one of the revisions between the 1.185 + snapshot and your target revision. The more history that a 1.186 + file accumulates, the more revisions you must read, hence the 1.187 + longer it takes to reconstruct a particular revision.</para> 1.188 + 1.189 + <informalfigure id="fig:concepts:snapshot"> 1.190 + <mediaobject><imageobject><imagedata 1.191 + fileref="snapshot"/></imageobject><textobject><phrase>XXX 1.192 + add text</phrase></textobject><caption><para>Snapshot of 1.193 + a revlog, with incremental 1.194 + deltas</para></caption></mediaobject> 1.195 + </informalfigure> 1.196 + 1.197 + <para>The innovation that Mercurial applies to this problem is 1.198 + simple but effective. Once the cumulative amount of delta 1.199 + information stored since the last snapshot exceeds a fixed 1.200 + threshold, it stores a new snapshot (compressed, of course), 1.201 + instead of another delta. This makes it possible to 1.202 + reconstruct <emphasis>any</emphasis> revision of a file 1.203 + quickly. This approach works so well that it has since been 1.204 + copied by several other revision control systems.</para> 1.205 + 1.206 + <para>Figure <xref linkend="fig:concepts:snapshot"/> illustrates 1.207 + the idea. In an entry in a revlog's index file, Mercurial 1.208 + stores the range of entries from the data file that it must 1.209 + read to reconstruct a particular revision.</para> 1.210 + 1.211 + <sect3> 1.212 + <title>Aside: the influence of video compression</title> 1.213 + 1.214 + <para>If you're familiar with video compression or have ever 1.215 + watched a TV feed through a digital cable or satellite 1.216 + service, you may know that most video compression schemes 1.217 + store each frame of video as a delta against its predecessor 1.218 + frame. In addition, these schemes use <quote>lossy</quote> 1.219 + compression techniques to increase the compression ratio, so 1.220 + visual errors accumulate over the course of a number of 1.221 + inter-frame deltas.</para> 1.222 + 1.223 + <para>Because it's possible for a video stream to <quote>drop 1.224 + out</quote> occasionally due to signal glitches, and to 1.225 + limit the accumulation of artefacts introduced by the lossy 1.226 + compression process, video encoders periodically insert a 1.227 + complete frame (called a <quote>key frame</quote>) into the 1.228 + video stream; the next delta is generated against that 1.229 + frame. This means that if the video signal gets 1.230 + interrupted, it will resume once the next key frame is 1.231 + received. Also, the accumulation of encoding errors 1.232 + restarts anew with each key frame.</para> 1.233 + 1.234 + </sect3> 1.235 + </sect2> 1.236 + <sect2> 1.237 + <title>Identification and strong integrity</title> 1.238 + 1.239 + <para>Along with delta or snapshot information, a revlog entry 1.240 + contains a cryptographic hash of the data that it represents. 1.241 + This makes it difficult to forge the contents of a revision, 1.242 + and easy to detect accidental corruption.</para> 1.243 + 1.244 + <para>Hashes provide more than a mere check against corruption; 1.245 + they are used as the identifiers for revisions. The changeset 1.246 + identification hashes that you see as an end user are from 1.247 + revisions of the changelog. Although filelogs and the 1.248 + manifest also use hashes, Mercurial only uses these behind the 1.249 + scenes.</para> 1.250 + 1.251 + <para>Mercurial verifies that hashes are correct when it 1.252 + retrieves file revisions and when it pulls changes from 1.253 + another repository. If it encounters an integrity problem, it 1.254 + will complain and stop whatever it's doing.</para> 1.255 + 1.256 + <para>In addition to the effect it has on retrieval efficiency, 1.257 + Mercurial's use of periodic snapshots makes it more robust 1.258 + against partial data corruption. If a revlog becomes partly 1.259 + corrupted due to a hardware error or system bug, it's often 1.260 + possible to reconstruct some or most revisions from the 1.261 + uncorrupted sections of the revlog, both before and after the 1.262 + corrupted section. This would not be possible with a 1.263 + delta-only storage model.</para> 1.264 + 1.265 + </sect2> 1.266 + </sect1> 1.267 + <sect1> 1.268 + <title>Revision history, branching, and merging</title> 1.269 + 1.270 + <para>Every entry in a Mercurial revlog knows the identity of its 1.271 + immediate ancestor revision, usually referred to as its 1.272 + <emphasis>parent</emphasis>. In fact, a revision contains room 1.273 + for not one parent, but two. Mercurial uses a special hash, 1.274 + called the <quote>null ID</quote>, to represent the idea 1.275 + <quote>there is no parent here</quote>. This hash is simply a 1.276 + string of zeroes.</para> 1.277 + 1.278 + <para>In figure <xref linkend="fig:concepts:revlog"/>, you can see 1.279 + an example of the conceptual structure of a revlog. Filelogs, 1.280 + manifests, and changelogs all have this same structure; they 1.281 + differ only in the kind of data stored in each delta or 1.282 + snapshot.</para> 1.283 + 1.284 + <para>The first revision in a revlog (at the bottom of the image) 1.285 + has the null ID in both of its parent slots. For a 1.286 + <quote>normal</quote> revision, its first parent slot contains 1.287 + the ID of its parent revision, and its second contains the null 1.288 + ID, indicating that the revision has only one real parent. Any 1.289 + two revisions that have the same parent ID are branches. A 1.290 + revision that represents a merge between branches has two normal 1.291 + revision IDs in its parent slots.</para> 1.292 + 1.293 + <informalfigure id="fig:concepts:revlog"> 1.294 + <mediaobject><imageobject><imagedata 1.295 + fileref="revlog"/></imageobject><textobject><phrase>XXX 1.296 + add text</phrase></textobject></mediaobject> 1.297 + </informalfigure> 1.298 + 1.299 + </sect1> 1.300 + <sect1> 1.301 + <title>The working directory</title> 1.302 + 1.303 + <para>In the working directory, Mercurial stores a snapshot of the 1.304 + files from the repository as of a particular changeset.</para> 1.305 + 1.306 + <para>The working directory <quote>knows</quote> which changeset 1.307 + it contains. When you update the working directory to contain a 1.308 + particular changeset, Mercurial looks up the appropriate 1.309 + revision of the manifest to find out which files it was tracking 1.310 + at the time that changeset was committed, and which revision of 1.311 + each file was then current. It then recreates a copy of each of 1.312 + those files, with the same contents it had when the changeset 1.313 + was committed.</para> 1.314 + 1.315 + <para>The <emphasis>dirstate</emphasis> contains Mercurial's 1.316 + knowledge of the working directory. This details which 1.317 + changeset the working directory is updated to, and all of the 1.318 + files that Mercurial is tracking in the working 1.319 + directory.</para> 1.320 + 1.321 + <para>Just as a revision of a revlog has room for two parents, so 1.322 + that it can represent either a normal revision (with one parent) 1.323 + or a merge of two earlier revisions, the dirstate has slots for 1.324 + two parents. When you use the <command role="hg-cmd">hg 1.325 + update</command> command, the changeset that you update to is 1.326 + stored in the <quote>first parent</quote> slot, and the null ID 1.327 + in the second. When you <command role="hg-cmd">hg 1.328 + merge</command> with another changeset, the first parent 1.329 + remains unchanged, and the second parent is filled in with the 1.330 + changeset you're merging with. The <command role="hg-cmd">hg 1.331 + parents</command> command tells you what the parents of the 1.332 + dirstate are.</para> 1.333 + 1.334 + <sect2> 1.335 + <title>What happens when you commit</title> 1.336 + 1.337 + <para>The dirstate stores parent information for more than just 1.338 + book-keeping purposes. Mercurial uses the parents of the 1.339 + dirstate as <emphasis>the parents of a new 1.340 + changeset</emphasis> when you perform a commit.</para> 1.341 + 1.342 + <informalfigure id="fig:concepts:wdir"> 1.343 + <mediaobject><imageobject><imagedata 1.344 + fileref="wdir"/></imageobject><textobject><phrase>XXX 1.345 + add text</phrase></textobject><caption><para>The working 1.346 + directory can have two 1.347 + parents</para></caption></mediaobject> 1.348 + </informalfigure> 1.349 + 1.350 + <para>Figure <xref linkend="fig:concepts:wdir"/> shows the 1.351 + normal state of the working directory, where it has a single 1.352 + changeset as parent. That changeset is the 1.353 + <emphasis>tip</emphasis>, the newest changeset in the 1.354 + repository that has no children.</para> 1.355 + 1.356 + <informalfigure id="fig:concepts:wdir-after-commit"> 1.357 + <mediaobject><imageobject><imagedata 1.358 + fileref="wdir-after-commit"/></imageobject><textobject><phrase>XXX 1.359 + add text</phrase></textobject><caption><para>The working 1.360 + directory gains new parents after a 1.361 + commit</para></caption></mediaobject> 1.362 + </informalfigure> 1.363 + 1.364 + <para>It's useful to think of the working directory as 1.365 + <quote>the changeset I'm about to commit</quote>. Any files 1.366 + that you tell Mercurial that you've added, removed, renamed, 1.367 + or copied will be reflected in that changeset, as will 1.368 + modifications to any files that Mercurial is already tracking; 1.369 + the new changeset will have the parents of the working 1.370 + directory as its parents.</para> 1.371 + 1.372 + <para>After a commit, Mercurial will update the parents of the 1.373 + working directory, so that the first parent is the ID of the 1.374 + new changeset, and the second is the null ID. This is shown 1.375 + in figure <xref linkend="fig:concepts:wdir-after-commit"/>. 1.376 + Mercurial 1.377 + doesn't touch any of the files in the working directory when 1.378 + you commit; it just modifies the dirstate to note its new 1.379 + parents.</para> 1.380 + 1.381 + </sect2> 1.382 + <sect2> 1.383 + <title>Creating a new head</title> 1.384 + 1.385 + <para>It's perfectly normal to update the working directory to a 1.386 + changeset other than the current tip. For example, you might 1.387 + want to know what your project looked like last Tuesday, or 1.388 + you could be looking through changesets to see which one 1.389 + introduced a bug. In cases like this, the natural thing to do 1.390 + is update the working directory to the changeset you're 1.391 + interested in, and then examine the files in the working 1.392 + directory directly to see their contents as they were when you 1.393 + committed that changeset. The effect of this is shown in 1.394 + figure <xref linkend="fig:concepts:wdir-pre-branch"/>.</para> 1.395 + 1.396 + <informalfigure id="fig:concepts:wdir-pre-branch"> 1.397 + <mediaobject><imageobject><imagedata 1.398 + fileref="wdir-pre-branch"/></imageobject><textobject><phrase>XXX 1.399 + add text</phrase></textobject><caption><para>The working 1.400 + directory, updated to an older 1.401 + changeset</para></caption></mediaobject> 1.402 + </informalfigure> 1.403 + 1.404 + <para>Having updated the working directory to an older 1.405 + changeset, what happens if you make some changes, and then 1.406 + commit? Mercurial behaves in the same way as I outlined 1.407 + above. The parents of the working directory become the 1.408 + parents of the new changeset. This new changeset has no 1.409 + children, so it becomes the new tip. And the repository now 1.410 + contains two changesets that have no children; we call these 1.411 + <emphasis>heads</emphasis>. You can see the structure that 1.412 + this creates in figure <xref 1.413 + linkend="fig:concepts:wdir-branch"/>.</para> 1.414 + 1.415 + <informalfigure id="fig:concepts:wdir-branch"> 1.416 + <mediaobject><imageobject><imagedata 1.417 + fileref="wdir-branch"/></imageobject><textobject><phrase>XXX 1.418 + add text</phrase></textobject><caption><para>After a 1.419 + commit made while synced to an older 1.420 + changeset</para></caption></mediaobject> 1.421 + </informalfigure> 1.422 + 1.423 + <note> 1.424 + <para> If you're new to Mercurial, you should keep in mind a 1.425 + common <quote>error</quote>, which is to use the <command 1.426 + role="hg-cmd">hg pull</command> command without any 1.427 + options. By default, the <command role="hg-cmd">hg 1.428 + pull</command> command <emphasis>does not</emphasis> 1.429 + update the working directory, so you'll bring new changesets 1.430 + into your repository, but the working directory will stay 1.431 + synced at the same changeset as before the pull. If you 1.432 + make some changes and commit afterwards, you'll thus create 1.433 + a new head, because your working directory isn't synced to 1.434 + whatever the current tip is.</para> 1.435 + 1.436 + <para> I put the word <quote>error</quote> in quotes because 1.437 + all that you need to do to rectify this situation is 1.438 + <command role="hg-cmd">hg merge</command>, then <command 1.439 + role="hg-cmd">hg commit</command>. In other words, this 1.440 + almost never has negative consequences; it just surprises 1.441 + people. I'll discuss other ways to avoid this behaviour, 1.442 + and why Mercurial behaves in this initially surprising way, 1.443 + later on.</para> 1.444 + </note> 1.445 + 1.446 + </sect2> 1.447 + <sect2> 1.448 + <title>Merging heads</title> 1.449 + 1.450 + <para>When you run the <command role="hg-cmd">hg merge</command> 1.451 + command, Mercurial leaves the first parent of the working 1.452 + directory unchanged, and sets the second parent to the 1.453 + changeset you're merging with, as shown in figure <xref 1.454 + linkend="fig:concepts:wdir-merge"/>.</para> 1.455 + 1.456 + <informalfigure id="fig:concepts:wdir-merge"> 1.457 + <mediaobject><imageobject><imagedata 1.458 + fileref="wdir-merge"/></imageobject><textobject><phrase>XXX 1.459 + add text</phrase></textobject><caption><para>Merging two 1.460 + heads</para></caption></mediaobject> 1.461 + </informalfigure> 1.462 + 1.463 + <para>Mercurial also has to modify the working directory, to 1.464 + merge the files managed in the two changesets. Simplified a 1.465 + little, the merging process goes like this, for every file in 1.466 + the manifests of both changesets.</para> 1.467 + <itemizedlist> 1.468 + <listitem><para>If neither changeset has modified a file, do 1.469 + nothing with that file.</para> 1.470 + </listitem> 1.471 + <listitem><para>If one changeset has modified a file, and the 1.472 + other hasn't, create the modified copy of the file in the 1.473 + working directory.</para> 1.474 + </listitem> 1.475 + <listitem><para>If one changeset has removed a file, and the 1.476 + other hasn't (or has also deleted it), delete the file 1.477 + from the working directory.</para> 1.478 + </listitem> 1.479 + <listitem><para>If one changeset has removed a file, but the 1.480 + other has modified the file, ask the user what to do: keep 1.481 + the modified file, or remove it?</para> 1.482 + </listitem> 1.483 + <listitem><para>If both changesets have modified a file, 1.484 + invoke an external merge program to choose the new 1.485 + contents for the merged file. This may require input from 1.486 + the user.</para> 1.487 + </listitem> 1.488 + <listitem><para>If one changeset has modified a file, and the 1.489 + other has renamed or copied the file, make sure that the 1.490 + changes follow the new name of the file.</para> 1.491 + </listitem></itemizedlist> 1.492 + <para>There are more details&emdash;merging has plenty of corner 1.493 + cases&emdash;but these are the most common choices that are 1.494 + involved in a merge. As you can see, most cases are 1.495 + completely automatic, and indeed most merges finish 1.496 + automatically, without requiring your input to resolve any 1.497 + conflicts.</para> 1.498 + 1.499 + <para>When you're thinking about what happens when you commit 1.500 + after a merge, once again the working directory is <quote>the 1.501 + changeset I'm about to commit</quote>. After the <command 1.502 + role="hg-cmd">hg merge</command> command completes, the 1.503 + working directory has two parents; these will become the 1.504 + parents of the new changeset.</para> 1.505 + 1.506 + <para>Mercurial lets you perform multiple merges, but you must 1.507 + commit the results of each individual merge as you go. This 1.508 + is necessary because Mercurial only tracks two parents for 1.509 + both revisions and the working directory. While it would be 1.510 + technically possible to merge multiple changesets at once, the 1.511 + prospect of user confusion and making a terrible mess of a 1.512 + merge immediately becomes overwhelming.</para> 1.513 + 1.514 + </sect2> 1.515 + </sect1> 1.516 + <sect1> 1.517 + <title>Other interesting design features</title> 1.518 + 1.519 + <para>In the sections above, I've tried to highlight some of the 1.520 + most important aspects of Mercurial's design, to illustrate that 1.521 + it pays careful attention to reliability and performance. 1.522 + However, the attention to detail doesn't stop there. There are 1.523 + a number of other aspects of Mercurial's construction that I 1.524 + personally find interesting. I'll detail a few of them here, 1.525 + separate from the <quote>big ticket</quote> items above, so that 1.526 + if you're interested, you can gain a better idea of the amount 1.527 + of thinking that goes into a well-designed system.</para> 1.528 + 1.529 + <sect2> 1.530 + <title>Clever compression</title> 1.531 + 1.532 + <para>When appropriate, Mercurial will store both snapshots and 1.533 + deltas in compressed form. It does this by always 1.534 + <emphasis>trying to</emphasis> compress a snapshot or delta, 1.535 + but only storing the compressed version if it's smaller than 1.536 + the uncompressed version.</para> 1.537 + 1.538 + <para>This means that Mercurial does <quote>the right 1.539 + thing</quote> when storing a file whose native form is 1.540 + compressed, such as a <literal>zip</literal> archive or a JPEG 1.541 + image. When these types of files are compressed a second 1.542 + time, the resulting file is usually bigger than the 1.543 + once-compressed form, and so Mercurial will store the plain 1.544 + <literal>zip</literal> or JPEG.</para> 1.545 + 1.546 + <para>Deltas between revisions of a compressed file are usually 1.547 + larger than snapshots of the file, and Mercurial again does 1.548 + <quote>the right thing</quote> in these cases. It finds that 1.549 + such a delta exceeds the threshold at which it should store a 1.550 + complete snapshot of the file, so it stores the snapshot, 1.551 + again saving space compared to a naive delta-only 1.552 + approach.</para> 1.553 + 1.554 + <sect3> 1.555 + <title>Network recompression</title> 1.556 + 1.557 + <para>When storing revisions on disk, Mercurial uses the 1.558 + <quote>deflate</quote> compression algorithm (the same one 1.559 + used by the popular <literal>zip</literal> archive format), 1.560 + which balances good speed with a respectable compression 1.561 + ratio. However, when transmitting revision data over a 1.562 + network connection, Mercurial uncompresses the compressed 1.563 + revision data.</para> 1.564 + 1.565 + <para>If the connection is over HTTP, Mercurial recompresses 1.566 + the entire stream of data using a compression algorithm that 1.567 + gives a better compression ratio (the Burrows-Wheeler 1.568 + algorithm from the widely used <literal>bzip2</literal> 1.569 + compression package). This combination of algorithm and 1.570 + compression of the entire stream (instead of a revision at a 1.571 + time) substantially reduces the number of bytes to be 1.572 + transferred, yielding better network performance over almost 1.573 + all kinds of network.</para> 1.574 + 1.575 + <para>(If the connection is over <command>ssh</command>, 1.576 + Mercurial <emphasis>doesn't</emphasis> recompress the 1.577 + stream, because <command>ssh</command> can already do this 1.578 + itself.)</para> 1.579 + 1.580 + </sect3> 1.581 + </sect2> 1.582 + <sect2> 1.583 + <title>Read/write ordering and atomicity</title> 1.584 + 1.585 + <para>Appending to files isn't the whole story when it comes to 1.586 + guaranteeing that a reader won't see a partial write. If you 1.587 + recall figure <xref linkend="fig:concepts:metadata"/>, 1.588 + revisions in the 1.589 + changelog point to revisions in the manifest, and revisions in 1.590 + the manifest point to revisions in filelogs. This hierarchy 1.591 + is deliberate.</para> 1.592 + 1.593 + <para>A writer starts a transaction by writing filelog and 1.594 + manifest data, and doesn't write any changelog data until 1.595 + those are finished. A reader starts by reading changelog 1.596 + data, then manifest data, followed by filelog data.</para> 1.597 + 1.598 + <para>Since the writer has always finished writing filelog and 1.599 + manifest data before it writes to the changelog, a reader will 1.600 + never read a pointer to a partially written manifest revision 1.601 + from the changelog, and it will never read a pointer to a 1.602 + partially written filelog revision from the manifest.</para> 1.603 + 1.604 + </sect2> 1.605 + <sect2> 1.606 + <title>Concurrent access</title> 1.607 + 1.608 + <para>The read/write ordering and atomicity guarantees mean that 1.609 + Mercurial never needs to <emphasis>lock</emphasis> a 1.610 + repository when it's reading data, even if the repository is 1.611 + being written to while the read is occurring. This has a big 1.612 + effect on scalability; you can have an arbitrary number of 1.613 + Mercurial processes safely reading data from a repository 1.614 + safely all at once, no matter whether it's being written to or 1.615 + not.</para> 1.616 + 1.617 + <para>The lockless nature of reading means that if you're 1.618 + sharing a repository on a multi-user system, you don't need to 1.619 + grant other local users permission to 1.620 + <emphasis>write</emphasis> to your repository in order for 1.621 + them to be able to clone it or pull changes from it; they only 1.622 + need <emphasis>read</emphasis> permission. (This is 1.623 + <emphasis>not</emphasis> a common feature among revision 1.624 + control systems, so don't take it for granted! Most require 1.625 + readers to be able to lock a repository to access it safely, 1.626 + and this requires write permission on at least one directory, 1.627 + which of course makes for all kinds of nasty and annoying 1.628 + security and administrative problems.)</para> 1.629 + 1.630 + <para>Mercurial uses locks to ensure that only one process can 1.631 + write to a repository at a time (the locking mechanism is safe 1.632 + even over filesystems that are notoriously hostile to locking, 1.633 + such as NFS). If a repository is locked, a writer will wait 1.634 + for a while to retry if the repository becomes unlocked, but 1.635 + if the repository remains locked for too long, the process 1.636 + attempting to write will time out after a while. This means 1.637 + that your daily automated scripts won't get stuck forever and 1.638 + pile up if a system crashes unnoticed, for example. (Yes, the 1.639 + timeout is configurable, from zero to infinity.)</para> 1.640 + 1.641 + <sect3> 1.642 + <title>Safe dirstate access</title> 1.643 + 1.644 + <para>As with revision data, Mercurial doesn't take a lock to 1.645 + read the dirstate file; it does acquire a lock to write it. 1.646 + To avoid the possibility of reading a partially written copy 1.647 + of the dirstate file, Mercurial writes to a file with a 1.648 + unique name in the same directory as the dirstate file, then 1.649 + renames the temporary file atomically to 1.650 + <filename>dirstate</filename>. The file named 1.651 + <filename>dirstate</filename> is thus guaranteed to be 1.652 + complete, not partially written.</para> 1.653 + 1.654 + </sect3> 1.655 + </sect2> 1.656 + <sect2> 1.657 + <title>Avoiding seeks</title> 1.658 + 1.659 + <para>Critical to Mercurial's performance is the avoidance of 1.660 + seeks of the disk head, since any seek is far more expensive 1.661 + than even a comparatively large read operation.</para> 1.662 + 1.663 + <para>This is why, for example, the dirstate is stored in a 1.664 + single file. If there were a dirstate file per directory that 1.665 + Mercurial tracked, the disk would seek once per directory. 1.666 + Instead, Mercurial reads the entire single dirstate file in 1.667 + one step.</para> 1.668 + 1.669 + <para>Mercurial also uses a <quote>copy on write</quote> scheme 1.670 + when cloning a repository on local storage. Instead of 1.671 + copying every revlog file from the old repository into the new 1.672 + repository, it makes a <quote>hard link</quote>, which is a 1.673 + shorthand way to say <quote>these two names point to the same 1.674 + file</quote>. When Mercurial is about to write to one of a 1.675 + revlog's files, it checks to see if the number of names 1.676 + pointing at the file is greater than one. If it is, more than 1.677 + one repository is using the file, so Mercurial makes a new 1.678 + copy of the file that is private to this repository.</para> 1.679 + 1.680 + <para>A few revision control developers have pointed out that 1.681 + this idea of making a complete private copy of a file is not 1.682 + very efficient in its use of storage. While this is true, 1.683 + storage is cheap, and this method gives the highest 1.684 + performance while deferring most book-keeping to the operating 1.685 + system. An alternative scheme would most likely reduce 1.686 + performance and increase the complexity of the software, each 1.687 + of which is much more important to the <quote>feel</quote> of 1.688 + day-to-day use.</para> 1.689 + 1.690 + </sect2> 1.691 + <sect2> 1.692 + <title>Other contents of the dirstate</title> 1.693 + 1.694 + <para>Because Mercurial doesn't force you to tell it when you're 1.695 + modifying a file, it uses the dirstate to store some extra 1.696 + information so it can determine efficiently whether you have 1.697 + modified a file. For each file in the working directory, it 1.698 + stores the time that it last modified the file itself, and the 1.699 + size of the file at that time.</para> 1.700 + 1.701 + <para>When you explicitly <command role="hg-cmd">hg 1.702 + add</command>, <command role="hg-cmd">hg remove</command>, 1.703 + <command role="hg-cmd">hg rename</command> or <command 1.704 + role="hg-cmd">hg copy</command> files, Mercurial updates the 1.705 + dirstate so that it knows what to do with those files when you 1.706 + commit.</para> 1.707 + 1.708 + <para>When Mercurial is checking the states of files in the 1.709 + working directory, it first checks a file's modification time. 1.710 + If that has not changed, the file must not have been modified. 1.711 + If the file's size has changed, the file must have been 1.712 + modified. If the modification time has changed, but the size 1.713 + has not, only then does Mercurial need to read the actual 1.714 + contents of the file to see if they've changed. Storing these 1.715 + few extra pieces of information dramatically reduces the 1.716 + amount of data that Mercurial needs to read, which yields 1.717 + large performance improvements compared to other revision 1.718 + control systems.</para> 1.719 + 1.720 + </sect2> 1.721 + </sect1> 1.722 +</chapter> 1.723 + 1.724 +<!-- 1.725 +local variables: 1.726 +sgml-parent-document: ("00book.xml" "book" "chapter") 1.727 +end: 1.728 +-->