belaran@964: <!-- vim: set filetype=docbkxml shiftwidth=2 autoindent expandtab tw=77 : --> belaran@964: bos@559: <chapter id="chap:concepts"> bos@572: <?dbhtml filename="behind-the-scenes.html"?> youshe@993: <title>Derrière le décor</title> youshe@993: youshe@993: <para id="x_2e8">À la différence de beaucoup d'outils de gestion de versions, youshe@993: les concepts sur lesquels se base Mercurial sont assez simples pour youshe@993: qu'il soit facile de comprendre comment le logiciel fonctionne. youshe@993: Bien que leur connaissance ne soit pas nécéssaire, je trouve utile youshe@993: d'avoir un <quote>modèle mental</quote> de ce qui se passe.</para> youshe@993: youshe@993: <para id="x_2e9">En effet, cette compréhension m'apporte la confiance que youshe@993: Mercurial a été développé avec soin pour être à la fois youshe@993: <emphasis>sûr</emphasis> et <emphasis>efficace</emphasis>. De surcroît, youshe@993: si il m'est facile de garder en tête ce que le logiciel fait lorsque youshe@993: j'accompli des tâches de révision, j'aurai moins de risques d'être youshe@993: surpris par son comportement.</para> youshe@993: youshe@993: <para id="x_2ea">Dans ce chapitre, nous décrirons tout d'abord les concepts youshe@993: essentiels de l'architecture de Mercurial, pour ensuite discuter quelques youshe@993: uns des détails intéressants de son implémentation.</para> bos@559: bos@559: <sect1> youshe@993: <title>Conservation de l'historique sous Mercurial</title> youshe@993: <sect2> youshe@993: <title>Suivi de l'historique pour un seul fichier</title> youshe@993: youshe@993: <para id="x_2eb">Lorsque Mercurial effectue un suivi des modifications youshe@993: faites à un fichier, il conserve l'historique pour ce fichier dans un youshe@993: <emphasis>filelog</emphasis> sous forme de métadonnées. Chaque entrée youshe@993: dans le filelog contient assez d'informations pour reconstituer une youshe@993: révision du fichier correspondant. Les filelogs sont des fichiers youshe@993: stockés dans le répertoire <filename role="special" youshe@993: class="directory">.hg/store/data</filename>. Un filelog contient youshe@993: des informations de deux types: les données de révision, et un index youshe@993: pour permettre à Mercurial une recherche efficace d'une révision youshe@993: donnée.</para> youshe@993: youshe@993: <para id="x_2ec">Lorsqu'un fichier devient trop gros ou a un long youshe@993: historique, son filelog se voit stocker dans un fichier de données youshe@993: (avec un suffixe <quote><literal>.d</literal></quote>) et un fichier youshe@993: index (avec un suffixe<quote><literal>.i</literal></quote>) youshe@993: distincts. La relation entre un fichier dans le répertoire de travail youshe@993: et le filelog couvrant le suivi de son historique dans le dépôt est youshe@993: illustré à la figure <xref linkend="fig:concepts:filelog"/>.</para> bos@559: bos@591: <figure id="fig:concepts:filelog"> youshe@993: <title>Relations entre les fichiers dans le répertoire de travail et youshe@993: leurs filelogs dans le dépôt</title> youshe@993: <mediaobject> <imageobject><imagedata youshe@993: fileref="figs/filelog.png"/></imageobject> youshe@993: <textobject><phrase>XXX add text</phrase></textobject> youshe@993: </mediaobject> </figure> youshe@993: youshe@993: </sect2> youshe@993: <sect2> youshe@993: <title>Gestion des fichiers suivis</title> youshe@993: youshe@993: <para id="x_2ee">Mercurial a recours à une structure nommée youshe@993: <emphasis>manifest</emphasis> pour rassembler les informations sur youshe@993: les fichiers dont il gère le suivi. Chaque entrée dans ce manifest youshe@993: contient des informations sur les fichiers présents dans une révision youshe@993: donnée. Une entrée store la liste des fichiers faisant partie de la youshe@993: révision, la version de chaque fichier, et quelques autres youshe@993: métadonnées sur ces fichiers.</para> bos@559: bos@559: </sect2> bos@559: <sect2> bos@559: <title>Recording changeset information</title> bos@559: youshe@993: <para id="x_2ef">The <emphasis>changelog</emphasis> contains youshe@993: information about each changeset. Each revision records who youshe@993: committed a change, the changeset comment, other pieces of youshe@993: changeset-related information, and the revision of the manifest to youshe@993: use.</para> bos@559: bos@559: </sect2> bos@559: <sect2> bos@559: <title>Relationships between revisions</title> bos@559: bos@584: <para id="x_2f0">Within a changelog, a manifest, or a filelog, each bos@559: revision stores a pointer to its immediate parent (or to its bos@559: two parents, if it's a merge revision). As I mentioned above, bos@559: there are also relationships between revisions bos@559: <emphasis>across</emphasis> these structures, and they are bos@559: hierarchical in nature.</para> bos@559: bos@584: <para id="x_2f1">For every changeset in a repository, there is exactly one bos@559: revision stored in the changelog. Each revision of the bos@559: changelog contains a pointer to a single revision of the bos@559: manifest. A revision of the manifest stores a pointer to a bos@559: single revision of each filelog tracked when that changeset bos@592: was created. These relationships are illustrated in bos@559: <xref linkend="fig:concepts:metadata"/>.</para> bos@559: bos@591: <figure id="fig:concepts:metadata"> bos@591: <title>Metadata relationships</title> bos@591: <mediaobject> bos@594: <imageobject><imagedata fileref="figs/metadata.png"/></imageobject> bos@591: <textobject><phrase>XXX add text</phrase></textobject> bos@559: </mediaobject> bos@591: </figure> bos@559: bos@584: <para id="x_2f3">As the illustration shows, there is bos@559: <emphasis>not</emphasis> a <quote>one to one</quote> bos@559: relationship between revisions in the changelog, manifest, or bos@701: filelog. If a file that bos@559: Mercurial tracks hasn't changed between two changesets, the bos@559: entry for that file in the two revisions of the manifest will bos@701: point to the same revision of its filelog<footnote> bos@702: <para id="x_725">It is possible (though unusual) for the manifest to bos@701: remain the same between two changesets, in which case the bos@701: changelog entries for those changesets will point to the bos@701: same revision of the manifest.</para> bos@701: </footnote>.</para> bos@559: bos@559: </sect2> bos@559: </sect1> bos@559: <sect1> bos@559: <title>Safe, efficient storage</title> bos@559: bos@584: <para id="x_2f4">The underpinnings of changelogs, manifests, and filelogs are bos@559: provided by a single structure called the bos@559: <emphasis>revlog</emphasis>.</para> bos@559: bos@559: <sect2> bos@559: <title>Efficient storage</title> bos@559: bos@584: <para id="x_2f5">The revlog provides efficient storage of revisions using a bos@559: <emphasis>delta</emphasis> mechanism. Instead of storing a bos@559: complete copy of a file for each revision, it stores the bos@559: changes needed to transform an older revision into the new bos@559: revision. For many kinds of file data, these deltas are bos@559: typically a fraction of a percent of the size of a full copy bos@559: of a file.</para> bos@559: bos@584: <para id="x_2f6">Some obsolete revision control systems can only work with bos@559: deltas of text files. They must either store binary files as bos@559: complete snapshots or encoded into a text representation, both bos@559: of which are wasteful approaches. Mercurial can efficiently bos@559: handle deltas of files with arbitrary binary contents; it bos@559: doesn't need to treat text as special.</para> bos@559: bos@559: </sect2> bos@559: <sect2 id="sec:concepts:txn"> bos@559: <title>Safe operation</title> bos@559: bos@584: <para id="x_2f7">Mercurial only ever <emphasis>appends</emphasis> data to bos@559: the end of a revlog file. It never modifies a section of a bos@559: file after it has written it. This is both more robust and bos@559: efficient than schemes that need to modify or rewrite bos@559: data.</para> bos@559: bos@584: <para id="x_2f8">In addition, Mercurial treats every write as part of a bos@559: <emphasis>transaction</emphasis> that can span a number of bos@559: files. A transaction is <emphasis>atomic</emphasis>: either bos@559: the entire transaction succeeds and its effects are all bos@559: visible to readers in one go, or the whole thing is undone. bos@559: This guarantee of atomicity means that if you're running two bos@559: copies of Mercurial, where one is reading data and one is bos@559: writing it, the reader will never see a partially written bos@559: result that might confuse it.</para> bos@559: bos@584: <para id="x_2f9">The fact that Mercurial only appends to files makes it bos@559: easier to provide this transactional guarantee. The easier it bos@559: is to do stuff like this, the more confident you should be bos@559: that it's done correctly.</para> bos@559: bos@559: </sect2> bos@559: <sect2> bos@559: <title>Fast retrieval</title> bos@559: bos@701: <para id="x_2fa">Mercurial cleverly avoids a pitfall common to bos@701: all earlier revision control systems: the problem of bos@701: <emphasis>inefficient retrieval</emphasis>. Most revision bos@701: control systems store the contents of a revision as an bos@701: incremental series of modifications against a bos@701: <quote>snapshot</quote>. (Some base the snapshot on the bos@701: oldest revision, others on the newest.) To reconstruct a bos@701: specific revision, you must first read the snapshot, and then bos@701: every one of the revisions between the snapshot and your bos@701: target revision. The more history that a file accumulates, bos@701: the more revisions you must read, hence the longer it takes to bos@701: reconstruct a particular revision.</para> bos@559: bos@591: <figure id="fig:concepts:snapshot"> bos@591: <title>Snapshot of a revlog, with incremental deltas</title> bos@591: <mediaobject> bos@594: <imageobject><imagedata fileref="figs/snapshot.png"/></imageobject> bos@591: <textobject><phrase>XXX add text</phrase></textobject> bos@591: </mediaobject> bos@591: </figure> bos@559: bos@584: <para id="x_2fc">The innovation that Mercurial applies to this problem is bos@559: simple but effective. Once the cumulative amount of delta bos@559: information stored since the last snapshot exceeds a fixed bos@559: threshold, it stores a new snapshot (compressed, of course), bos@559: instead of another delta. This makes it possible to bos@559: reconstruct <emphasis>any</emphasis> revision of a file bos@559: quickly. This approach works so well that it has since been bos@559: copied by several other revision control systems.</para> bos@559: bos@592: <para id="x_2fd"><xref linkend="fig:concepts:snapshot"/> illustrates bos@559: the idea. In an entry in a revlog's index file, Mercurial bos@559: stores the range of entries from the data file that it must bos@559: read to reconstruct a particular revision.</para> bos@559: bos@559: <sect3> bos@559: <title>Aside: the influence of video compression</title> bos@559: bos@701: <para id="x_2fe">If you're familiar with video compression or bos@701: have ever watched a TV feed through a digital cable or bos@701: satellite service, you may know that most video compression bos@701: schemes store each frame of video as a delta against its bos@701: predecessor frame.</para> bos@701: bos@701: <para id="x_2ff">Mercurial borrows this idea to make it bos@701: possible to reconstruct a revision from a snapshot and a bos@701: small number of deltas.</para> bos@559: bos@559: </sect3> bos@559: </sect2> bos@559: <sect2> bos@559: <title>Identification and strong integrity</title> bos@559: bos@584: <para id="x_300">Along with delta or snapshot information, a revlog entry bos@559: contains a cryptographic hash of the data that it represents. bos@559: This makes it difficult to forge the contents of a revision, bos@559: and easy to detect accidental corruption.</para> bos@559: bos@584: <para id="x_301">Hashes provide more than a mere check against corruption; bos@559: they are used as the identifiers for revisions. The changeset bos@559: identification hashes that you see as an end user are from bos@559: revisions of the changelog. Although filelogs and the bos@559: manifest also use hashes, Mercurial only uses these behind the bos@559: scenes.</para> bos@559: bos@584: <para id="x_302">Mercurial verifies that hashes are correct when it bos@559: retrieves file revisions and when it pulls changes from bos@559: another repository. If it encounters an integrity problem, it bos@559: will complain and stop whatever it's doing.</para> bos@559: bos@584: <para id="x_303">In addition to the effect it has on retrieval efficiency, bos@559: Mercurial's use of periodic snapshots makes it more robust bos@559: against partial data corruption. If a revlog becomes partly bos@559: corrupted due to a hardware error or system bug, it's often bos@559: possible to reconstruct some or most revisions from the bos@559: uncorrupted sections of the revlog, both before and after the bos@559: corrupted section. This would not be possible with a bos@559: delta-only storage model.</para> bos@559: </sect2> bos@559: </sect1> bos@701: bos@559: <sect1> bos@559: <title>Revision history, branching, and merging</title> bos@559: bos@584: <para id="x_304">Every entry in a Mercurial revlog knows the identity of its bos@559: immediate ancestor revision, usually referred to as its bos@559: <emphasis>parent</emphasis>. In fact, a revision contains room bos@559: for not one parent, but two. Mercurial uses a special hash, bos@559: called the <quote>null ID</quote>, to represent the idea bos@559: <quote>there is no parent here</quote>. This hash is simply a bos@559: string of zeroes.</para> bos@559: bos@592: <para id="x_305">In <xref linkend="fig:concepts:revlog"/>, you can see bos@559: an example of the conceptual structure of a revlog. Filelogs, bos@559: manifests, and changelogs all have this same structure; they bos@559: differ only in the kind of data stored in each delta or bos@559: snapshot.</para> bos@559: bos@584: <para id="x_306">The first revision in a revlog (at the bottom of the image) bos@559: has the null ID in both of its parent slots. For a bos@559: <quote>normal</quote> revision, its first parent slot contains bos@559: the ID of its parent revision, and its second contains the null bos@559: ID, indicating that the revision has only one real parent. Any bos@559: two revisions that have the same parent ID are branches. A bos@559: revision that represents a merge between branches has two normal bos@559: revision IDs in its parent slots.</para> bos@559: bos@591: <figure id="fig:concepts:revlog"> bos@591: <title>The conceptual structure of a revlog</title> bos@591: <mediaobject> bos@594: <imageobject><imagedata fileref="figs/revlog.png"/></imageobject> bos@591: <textobject><phrase>XXX add text</phrase></textobject> bos@591: </mediaobject> bos@591: </figure> bos@559: bos@559: </sect1> bos@559: <sect1> bos@559: <title>The working directory</title> bos@559: bos@584: <para id="x_307">In the working directory, Mercurial stores a snapshot of the bos@559: files from the repository as of a particular changeset.</para> bos@559: bos@584: <para id="x_308">The working directory <quote>knows</quote> which changeset bos@559: it contains. When you update the working directory to contain a bos@559: particular changeset, Mercurial looks up the appropriate bos@559: revision of the manifest to find out which files it was tracking bos@559: at the time that changeset was committed, and which revision of bos@559: each file was then current. It then recreates a copy of each of bos@559: those files, with the same contents it had when the changeset bos@559: was committed.</para> bos@559: bos@701: <para id="x_309">The <emphasis>dirstate</emphasis> is a special bos@701: structure that contains Mercurial's knowledge of the working bos@701: directory. It is maintained as a file named bos@701: <filename>.hg/dirstate</filename> inside a repository. The bos@701: dirstate details which changeset the working directory is bos@701: updated to, and all of the files that Mercurial is tracking in bos@701: the working directory. It also lets Mercurial quickly notice bos@701: changed files, by recording their checkout times and bos@701: sizes.</para> bos@559: bos@584: <para id="x_30a">Just as a revision of a revlog has room for two parents, so bos@559: that it can represent either a normal revision (with one parent) bos@559: or a merge of two earlier revisions, the dirstate has slots for bos@559: two parents. When you use the <command role="hg-cmd">hg bos@559: update</command> command, the changeset that you update to is bos@559: stored in the <quote>first parent</quote> slot, and the null ID bos@559: in the second. When you <command role="hg-cmd">hg bos@559: merge</command> with another changeset, the first parent bos@559: remains unchanged, and the second parent is filled in with the bos@559: changeset you're merging with. The <command role="hg-cmd">hg bos@559: parents</command> command tells you what the parents of the bos@559: dirstate are.</para> bos@559: bos@559: <sect2> bos@559: <title>What happens when you commit</title> bos@559: bos@584: <para id="x_30b">The dirstate stores parent information for more than just bos@559: book-keeping purposes. Mercurial uses the parents of the bos@559: dirstate as <emphasis>the parents of a new bos@559: changeset</emphasis> when you perform a commit.</para> bos@559: bos@591: <figure id="fig:concepts:wdir"> bos@591: <title>The working directory can have two parents</title> bos@591: <mediaobject> bos@594: <imageobject><imagedata fileref="figs/wdir.png"/></imageobject> bos@591: <textobject><phrase>XXX add text</phrase></textobject> bos@591: </mediaobject> bos@591: </figure> bos@559: bos@592: <para id="x_30d"><xref linkend="fig:concepts:wdir"/> shows the bos@559: normal state of the working directory, where it has a single bos@559: changeset as parent. That changeset is the bos@559: <emphasis>tip</emphasis>, the newest changeset in the bos@559: repository that has no children.</para> bos@559: bos@591: <figure id="fig:concepts:wdir-after-commit"> bos@591: <title>The working directory gains new parents after a bos@591: commit</title> bos@591: <mediaobject> bos@594: <imageobject><imagedata fileref="figs/wdir-after-commit.png"/></imageobject> bos@591: <textobject><phrase>XXX add text</phrase></textobject> bos@591: </mediaobject> bos@591: </figure> bos@559: bos@584: <para id="x_30f">It's useful to think of the working directory as bos@559: <quote>the changeset I'm about to commit</quote>. Any files bos@559: that you tell Mercurial that you've added, removed, renamed, bos@559: or copied will be reflected in that changeset, as will bos@559: modifications to any files that Mercurial is already tracking; bos@559: the new changeset will have the parents of the working bos@559: directory as its parents.</para> bos@559: bos@592: <para id="x_310">After a commit, Mercurial will update the bos@592: parents of the working directory, so that the first parent is bos@592: the ID of the new changeset, and the second is the null ID. bos@592: This is shown in <xref bos@592: linkend="fig:concepts:wdir-after-commit"/>. Mercurial bos@559: doesn't touch any of the files in the working directory when bos@559: you commit; it just modifies the dirstate to note its new bos@559: parents.</para> bos@559: bos@559: </sect2> bos@559: <sect2> bos@559: <title>Creating a new head</title> bos@559: bos@584: <para id="x_311">It's perfectly normal to update the working directory to a bos@559: changeset other than the current tip. For example, you might bos@559: want to know what your project looked like last Tuesday, or bos@559: you could be looking through changesets to see which one bos@559: introduced a bug. In cases like this, the natural thing to do bos@559: is update the working directory to the changeset you're bos@559: interested in, and then examine the files in the working bos@559: directory directly to see their contents as they were when you bos@559: committed that changeset. The effect of this is shown in bos@592: <xref linkend="fig:concepts:wdir-pre-branch"/>.</para> bos@559: bos@591: <figure id="fig:concepts:wdir-pre-branch"> bos@591: <title>The working directory, updated to an older bos@591: changeset</title> bos@591: <mediaobject> bos@594: <imageobject><imagedata fileref="figs/wdir-pre-branch.png"/></imageobject> bos@591: <textobject><phrase>XXX add text</phrase></textobject> bos@591: </mediaobject> bos@591: </figure> bos@559: bos@592: <para id="x_313">Having updated the working directory to an bos@592: older changeset, what happens if you make some changes, and bos@592: then commit? Mercurial behaves in the same way as I outlined bos@559: above. The parents of the working directory become the bos@559: parents of the new changeset. This new changeset has no bos@559: children, so it becomes the new tip. And the repository now bos@559: contains two changesets that have no children; we call these bos@559: <emphasis>heads</emphasis>. You can see the structure that bos@592: this creates in <xref bos@559: linkend="fig:concepts:wdir-branch"/>.</para> bos@559: bos@591: <figure id="fig:concepts:wdir-branch"> bos@591: <title>After a commit made while synced to an older bos@591: changeset</title> bos@591: <mediaobject> bos@594: <imageobject><imagedata fileref="figs/wdir-branch.png"/></imageobject> bos@591: <textobject><phrase>XXX add text</phrase></textobject> bos@591: </mediaobject> bos@591: </figure> bos@559: bos@559: <note> bos@701: <para id="x_315">If you're new to Mercurial, you should keep bos@701: in mind a common <quote>error</quote>, which is to use the bos@701: <command role="hg-cmd">hg pull</command> command without any bos@559: options. By default, the <command role="hg-cmd">hg bos@559: pull</command> command <emphasis>does not</emphasis> bos@559: update the working directory, so you'll bring new changesets bos@559: into your repository, but the working directory will stay bos@559: synced at the same changeset as before the pull. If you bos@559: make some changes and commit afterwards, you'll thus create bos@559: a new head, because your working directory isn't synced to bos@701: whatever the current tip is. To combine the operation of a bos@701: pull, followed by an update, run <command>hg pull bos@701: -u</command>.</para> bos@701: bos@701: <para id="x_316">I put the word <quote>error</quote> in quotes bos@701: because all that you need to do to rectify the situation bos@701: where you created a new head by accident is bos@701: <command role="hg-cmd">hg merge</command>, then <command bos@701: role="hg-cmd">hg commit</command>. In other words, this bos@701: almost never has negative consequences; it's just something bos@701: of a surprise for newcomers. I'll discuss other ways to bos@701: avoid this behavior, and why Mercurial behaves in this bos@701: initially surprising way, later on.</para> bos@559: </note> bos@559: bos@559: </sect2> bos@559: <sect2> bos@620: <title>Merging changes</title> bos@559: bos@592: <para id="x_317">When you run the <command role="hg-cmd">hg bos@592: merge</command> command, Mercurial leaves the first parent bos@592: of the working directory unchanged, and sets the second parent bos@592: to the changeset you're merging with, as shown in <xref bos@559: linkend="fig:concepts:wdir-merge"/>.</para> bos@559: bos@591: <figure id="fig:concepts:wdir-merge"> bos@591: <title>Merging two heads</title> bos@591: <mediaobject> bos@591: <imageobject> bos@594: <imagedata fileref="figs/wdir-merge.png"/> bos@591: </imageobject> bos@591: <textobject><phrase>XXX add text</phrase></textobject> bos@591: </mediaobject> bos@591: </figure> bos@559: bos@584: <para id="x_319">Mercurial also has to modify the working directory, to bos@559: merge the files managed in the two changesets. Simplified a bos@559: little, the merging process goes like this, for every file in bos@559: the manifests of both changesets.</para> bos@559: <itemizedlist> bos@584: <listitem><para id="x_31a">If neither changeset has modified a file, do bos@559: nothing with that file.</para> bos@559: </listitem> bos@584: <listitem><para id="x_31b">If one changeset has modified a file, and the bos@559: other hasn't, create the modified copy of the file in the bos@559: working directory.</para> bos@559: </listitem> bos@584: <listitem><para id="x_31c">If one changeset has removed a file, and the bos@559: other hasn't (or has also deleted it), delete the file bos@559: from the working directory.</para> bos@559: </listitem> bos@584: <listitem><para id="x_31d">If one changeset has removed a file, but the bos@559: other has modified the file, ask the user what to do: keep bos@559: the modified file, or remove it?</para> bos@559: </listitem> bos@584: <listitem><para id="x_31e">If both changesets have modified a file, bos@559: invoke an external merge program to choose the new bos@559: contents for the merged file. This may require input from bos@559: the user.</para> bos@559: </listitem> bos@584: <listitem><para id="x_31f">If one changeset has modified a file, and the bos@559: other has renamed or copied the file, make sure that the bos@559: changes follow the new name of the file.</para> bos@559: </listitem></itemizedlist> bos@584: <para id="x_320">There are more details&emdash;merging has plenty of corner bos@559: cases&emdash;but these are the most common choices that are bos@559: involved in a merge. As you can see, most cases are bos@559: completely automatic, and indeed most merges finish bos@559: automatically, without requiring your input to resolve any bos@559: conflicts.</para> bos@559: bos@584: <para id="x_321">When you're thinking about what happens when you commit bos@559: after a merge, once again the working directory is <quote>the bos@559: changeset I'm about to commit</quote>. After the <command bos@559: role="hg-cmd">hg merge</command> command completes, the bos@559: working directory has two parents; these will become the bos@559: parents of the new changeset.</para> bos@559: bos@701: <para id="x_322">Mercurial lets you perform multiple merges, but bos@701: you must commit the results of each individual merge as you bos@701: go. This is necessary because Mercurial only tracks two bos@701: parents for both revisions and the working directory. While bos@701: it would be technically feasible to merge multiple changesets bos@701: at once, Mercurial avoids this for simplicity. With multi-way bos@701: merges, the risks of user confusion, nasty conflict bos@701: resolution, and making a terrible mess of a merge would grow bos@701: intolerable.</para> bos@559: bos@559: </sect2> bos@620: bos@620: <sect2> bos@620: <title>Merging and renames</title> bos@620: bos@676: <para id="x_69a">A surprising number of revision control systems pay little bos@620: or no attention to a file's <emphasis>name</emphasis> over bos@620: time. For instance, it used to be common that if a file got bos@620: renamed on one side of a merge, the changes from the other bos@620: side would be silently dropped.</para> bos@620: bos@676: <para id="x_69b">Mercurial records metadata when you tell it to perform a bos@620: rename or copy. It uses this metadata during a merge to do the bos@620: right thing in the case of a merge. For instance, if I rename bos@620: a file, and you edit it without renaming it, when we merge our bos@620: work the file will be renamed and have your edits bos@620: applied.</para> bos@620: </sect2> bos@559: </sect1> bos@620: bos@559: <sect1> bos@559: <title>Other interesting design features</title> bos@559: bos@584: <para id="x_323">In the sections above, I've tried to highlight some of the bos@559: most important aspects of Mercurial's design, to illustrate that bos@559: it pays careful attention to reliability and performance. bos@559: However, the attention to detail doesn't stop there. There are bos@559: a number of other aspects of Mercurial's construction that I bos@559: personally find interesting. I'll detail a few of them here, bos@559: separate from the <quote>big ticket</quote> items above, so that bos@559: if you're interested, you can gain a better idea of the amount bos@559: of thinking that goes into a well-designed system.</para> bos@559: bos@559: <sect2> bos@559: <title>Clever compression</title> bos@559: bos@584: <para id="x_324">When appropriate, Mercurial will store both snapshots and bos@559: deltas in compressed form. It does this by always bos@559: <emphasis>trying to</emphasis> compress a snapshot or delta, bos@559: but only storing the compressed version if it's smaller than bos@559: the uncompressed version.</para> bos@559: bos@584: <para id="x_325">This means that Mercurial does <quote>the right bos@559: thing</quote> when storing a file whose native form is bos@559: compressed, such as a <literal>zip</literal> archive or a JPEG bos@559: image. When these types of files are compressed a second bos@559: time, the resulting file is usually bigger than the bos@559: once-compressed form, and so Mercurial will store the plain bos@559: <literal>zip</literal> or JPEG.</para> bos@559: bos@584: <para id="x_326">Deltas between revisions of a compressed file are usually bos@559: larger than snapshots of the file, and Mercurial again does bos@559: <quote>the right thing</quote> in these cases. It finds that bos@559: such a delta exceeds the threshold at which it should store a bos@559: complete snapshot of the file, so it stores the snapshot, bos@559: again saving space compared to a naive delta-only bos@559: approach.</para> bos@559: bos@559: <sect3> bos@559: <title>Network recompression</title> bos@559: bos@584: <para id="x_327">When storing revisions on disk, Mercurial uses the bos@559: <quote>deflate</quote> compression algorithm (the same one bos@559: used by the popular <literal>zip</literal> archive format), bos@559: which balances good speed with a respectable compression bos@559: ratio. However, when transmitting revision data over a bos@559: network connection, Mercurial uncompresses the compressed bos@559: revision data.</para> bos@559: bos@584: <para id="x_328">If the connection is over HTTP, Mercurial recompresses bos@559: the entire stream of data using a compression algorithm that bos@559: gives a better compression ratio (the Burrows-Wheeler bos@559: algorithm from the widely used <literal>bzip2</literal> bos@559: compression package). This combination of algorithm and bos@559: compression of the entire stream (instead of a revision at a bos@559: time) substantially reduces the number of bytes to be bos@620: transferred, yielding better network performance over most bos@620: kinds of network.</para> bos@559: bos@701: <para id="x_329">If the connection is over bos@701: <command>ssh</command>, Mercurial bos@701: <emphasis>doesn't</emphasis> recompress the stream, because bos@701: <command>ssh</command> can already do this itself. You can bos@701: tell Mercurial to always use <command>ssh</command>'s bos@701: compression feature by editing the bos@701: <filename>.hgrc</filename> file in your home directory as bos@701: follows.</para> bos@701: bos@701: <programlisting>[ui] bos@701: ssh = ssh -C</programlisting> bos@559: bos@559: </sect3> bos@559: </sect2> bos@559: <sect2> bos@559: <title>Read/write ordering and atomicity</title> bos@559: bos@592: <para id="x_32a">Appending to files isn't the whole story when bos@592: it comes to guaranteeing that a reader won't see a partial bos@592: write. If you recall <xref linkend="fig:concepts:metadata"/>, bos@701: revisions in the changelog point to revisions in the manifest, bos@701: and revisions in the manifest point to revisions in filelogs. bos@592: This hierarchy is deliberate.</para> bos@559: bos@584: <para id="x_32b">A writer starts a transaction by writing filelog and bos@559: manifest data, and doesn't write any changelog data until bos@559: those are finished. A reader starts by reading changelog bos@559: data, then manifest data, followed by filelog data.</para> bos@559: bos@584: <para id="x_32c">Since the writer has always finished writing filelog and bos@559: manifest data before it writes to the changelog, a reader will bos@559: never read a pointer to a partially written manifest revision bos@559: from the changelog, and it will never read a pointer to a bos@559: partially written filelog revision from the manifest.</para> bos@559: bos@559: </sect2> bos@559: <sect2> bos@559: <title>Concurrent access</title> bos@559: bos@584: <para id="x_32d">The read/write ordering and atomicity guarantees mean that bos@559: Mercurial never needs to <emphasis>lock</emphasis> a bos@559: repository when it's reading data, even if the repository is bos@559: being written to while the read is occurring. This has a big bos@559: effect on scalability; you can have an arbitrary number of bos@559: Mercurial processes safely reading data from a repository bos@701: all at once, no matter whether it's being written to or bos@559: not.</para> bos@559: bos@584: <para id="x_32e">The lockless nature of reading means that if you're bos@559: sharing a repository on a multi-user system, you don't need to bos@559: grant other local users permission to bos@559: <emphasis>write</emphasis> to your repository in order for bos@559: them to be able to clone it or pull changes from it; they only bos@559: need <emphasis>read</emphasis> permission. (This is bos@559: <emphasis>not</emphasis> a common feature among revision bos@559: control systems, so don't take it for granted! Most require bos@559: readers to be able to lock a repository to access it safely, bos@559: and this requires write permission on at least one directory, bos@559: which of course makes for all kinds of nasty and annoying bos@559: security and administrative problems.)</para> bos@559: bos@584: <para id="x_32f">Mercurial uses locks to ensure that only one process can bos@559: write to a repository at a time (the locking mechanism is safe bos@559: even over filesystems that are notoriously hostile to locking, bos@559: such as NFS). If a repository is locked, a writer will wait bos@559: for a while to retry if the repository becomes unlocked, but bos@559: if the repository remains locked for too long, the process bos@559: attempting to write will time out after a while. This means bos@559: that your daily automated scripts won't get stuck forever and bos@559: pile up if a system crashes unnoticed, for example. (Yes, the bos@559: timeout is configurable, from zero to infinity.)</para> bos@559: bos@559: <sect3> bos@559: <title>Safe dirstate access</title> bos@559: bos@584: <para id="x_330">As with revision data, Mercurial doesn't take a lock to bos@559: read the dirstate file; it does acquire a lock to write it. bos@559: To avoid the possibility of reading a partially written copy bos@559: of the dirstate file, Mercurial writes to a file with a bos@559: unique name in the same directory as the dirstate file, then bos@559: renames the temporary file atomically to bos@559: <filename>dirstate</filename>. The file named bos@559: <filename>dirstate</filename> is thus guaranteed to be bos@559: complete, not partially written.</para> bos@559: bos@559: </sect3> bos@559: </sect2> bos@559: <sect2> bos@559: <title>Avoiding seeks</title> bos@559: bos@584: <para id="x_331">Critical to Mercurial's performance is the avoidance of bos@559: seeks of the disk head, since any seek is far more expensive bos@559: than even a comparatively large read operation.</para> bos@559: bos@584: <para id="x_332">This is why, for example, the dirstate is stored in a bos@559: single file. If there were a dirstate file per directory that bos@559: Mercurial tracked, the disk would seek once per directory. bos@559: Instead, Mercurial reads the entire single dirstate file in bos@559: one step.</para> bos@559: bos@584: <para id="x_333">Mercurial also uses a <quote>copy on write</quote> scheme bos@559: when cloning a repository on local storage. Instead of bos@559: copying every revlog file from the old repository into the new bos@559: repository, it makes a <quote>hard link</quote>, which is a bos@559: shorthand way to say <quote>these two names point to the same bos@559: file</quote>. When Mercurial is about to write to one of a bos@559: revlog's files, it checks to see if the number of names bos@559: pointing at the file is greater than one. If it is, more than bos@559: one repository is using the file, so Mercurial makes a new bos@559: copy of the file that is private to this repository.</para> bos@559: bos@584: <para id="x_334">A few revision control developers have pointed out that bos@559: this idea of making a complete private copy of a file is not bos@559: very efficient in its use of storage. While this is true, bos@559: storage is cheap, and this method gives the highest bos@559: performance while deferring most book-keeping to the operating bos@559: system. An alternative scheme would most likely reduce bos@701: performance and increase the complexity of the software, but bos@701: speed and simplicity are key to the <quote>feel</quote> of bos@559: day-to-day use.</para> bos@559: bos@559: </sect2> bos@559: <sect2> bos@559: <title>Other contents of the dirstate</title> bos@559: bos@584: <para id="x_335">Because Mercurial doesn't force you to tell it when you're bos@559: modifying a file, it uses the dirstate to store some extra bos@559: information so it can determine efficiently whether you have bos@559: modified a file. For each file in the working directory, it bos@559: stores the time that it last modified the file itself, and the bos@559: size of the file at that time.</para> bos@559: bos@584: <para id="x_336">When you explicitly <command role="hg-cmd">hg bos@559: add</command>, <command role="hg-cmd">hg remove</command>, bos@559: <command role="hg-cmd">hg rename</command> or <command bos@559: role="hg-cmd">hg copy</command> files, Mercurial updates the bos@559: dirstate so that it knows what to do with those files when you bos@559: commit.</para> bos@559: bos@701: <para id="x_337">The dirstate helps Mercurial to efficiently bos@701: check the status of files in a repository.</para> bos@701: bos@701: <itemizedlist> bos@701: <listitem> bos@702: <para id="x_726">When Mercurial checks the state of a file in the bos@701: working directory, it first checks a file's modification bos@701: time against the time in the dirstate that records when bos@701: Mercurial last wrote the file. If the last modified time bos@701: is the same as the time when Mercurial wrote the file, the bos@701: file must not have been modified, so Mercurial does not bos@701: need to check any further.</para> bos@701: </listitem> bos@701: <listitem> bos@702: <para id="x_727">If the file's size has changed, the file must have bos@701: been modified. If the modification time has changed, but bos@701: the size has not, only then does Mercurial need to bos@701: actually read the contents of the file to see if it has bos@701: changed.</para> bos@701: </listitem> bos@701: </itemizedlist> bos@701: bos@702: <para id="x_728">Storing the modification time and size dramatically bos@701: reduces the number of read operations that Mercurial needs to bos@701: perform when we run commands like <command>hg status</command>. bos@701: This results in large performance improvements.</para> bos@559: </sect2> bos@559: </sect1> belaran@964: </chapter> belaran@964: belaran@964: <!-- belaran@964: local variables: belaran@964: sgml-parent-document: ("00book.xml" "book" "chapter") belaran@964: end: bos@559: -->