belaran@964: <!-- vim: set filetype=docbkxml shiftwidth=2 autoindent expandtab tw=77 : -->
belaran@964: 
belaran@964: <chapter>
belaran@964: <title>Behind the scenes</title>
belaran@964: <para>\label{chap:concepts}</para>
belaran@964: 
belaran@964: <para>Unlike many revision control systems, the concepts upon which
belaran@964: Mercurial is built are simple enough that it's easy to understand how
belaran@964: the software really works.  Knowing this certainly isn't necessary,
belaran@964: but I find it useful to have a <quote>mental model</quote> of what's going on.</para>
belaran@964: 
belaran@964: <para>This understanding gives me confidence that Mercurial has been
belaran@964: carefully designed to be both <emphasis>safe</emphasis> and <emphasis>efficient</emphasis>.  And
belaran@964: just as importantly, if it's easy for me to retain a good idea of what
belaran@964: the software is doing when I perform a revision control task, I'm less
belaran@964: likely to be surprised by its behaviour.</para>
belaran@964: 
belaran@964: <para>In this chapter, we'll initially cover the core concepts behind
belaran@964: Mercurial's design, then continue to discuss some of the interesting
belaran@964: details of its implementation.</para>
belaran@964: 
belaran@964: <sect1>
belaran@964: <title>Mercurial's historical record</title>
belaran@964: 
belaran@964: <sect2>
belaran@964: <title>Tracking the history of a single file</title>
belaran@964: 
belaran@964: <para>When Mercurial tracks modifications to a file, it stores the history
belaran@964: of that file in a metadata object called a <emphasis>filelog</emphasis>.  Each entry
belaran@964: in the filelog contains enough information to reconstruct one revision
belaran@964: of the file that is being tracked.  Filelogs are stored as files in
belaran@964: the <filename role="special" class="directory">.hg/store/data</filename> directory.  A filelog contains two kinds
belaran@964: of information: revision data, and an index to help Mercurial to find
belaran@964: a revision efficiently.</para>
belaran@964: 
belaran@964: <para>A file that is large, or has a lot of history, has its filelog stored
belaran@964: in separate data (<quote><literal>.d</literal></quote> suffix) and index (<quote><literal>.i</literal></quote>
belaran@964: suffix) files.  For small files without much history, the revision
belaran@964: data and index are combined in a single <quote><literal>.i</literal></quote> file.  The
belaran@964: correspondence between a file in the working directory and the filelog
belaran@964: that tracks its history in the repository is illustrated in
belaran@964: figure <xref linkend="fig:concepts:filelog"/>.</para>
belaran@964: 
belaran@964: <informalfigure>
belaran@964: 
belaran@964: <para>  <mediaobject><imageobject><imagedata fileref="filelog"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject>
belaran@964:   \caption{Relationships between files in working directory and
belaran@964:     filelogs in repository}
belaran@964:   \label{fig:concepts:filelog}</para>
belaran@964: </informalfigure>
belaran@964: 
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Managing tracked files</title>
belaran@964: 
belaran@964: <para>Mercurial uses a structure called a <emphasis>manifest</emphasis> to collect
belaran@964: together information about the files that it tracks.  Each entry in
belaran@964: the manifest contains information about the files present in a single
belaran@964: changeset.  An entry records which files are present in the changeset,
belaran@964: the revision of each file, and a few other pieces of file metadata.</para>
belaran@964: 
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Recording changeset information</title>
belaran@964: 
belaran@964: <para>The <emphasis>changelog</emphasis> contains information about each changeset.  Each
belaran@964: revision records who committed a change, the changeset comment, other
belaran@964: pieces of changeset-related information, and the revision of the
belaran@964: manifest to use.
belaran@964: </para>
belaran@964: 
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Relationships between revisions</title>
belaran@964: 
belaran@964: <para>Within a changelog, a manifest, or a filelog, each revision stores a
belaran@964: pointer to its immediate parent (or to its two parents, if it's a
belaran@964: merge revision).  As I mentioned above, there are also relationships
belaran@964: between revisions <emphasis>across</emphasis> these structures, and they are
belaran@964: hierarchical in nature.
belaran@964: </para>
belaran@964: 
belaran@964: <para>For every changeset in a repository, there is exactly one revision
belaran@964: stored in the changelog.  Each revision of the changelog contains a
belaran@964: pointer to a single revision of the manifest.  A revision of the
belaran@964: manifest stores a pointer to a single revision of each filelog tracked
belaran@964: when that changeset was created.  These relationships are illustrated
belaran@964: in figure <xref linkend="fig:concepts:metadata"/>.
belaran@964: </para>
belaran@964: 
belaran@964: <informalfigure>
belaran@964: 
belaran@964: <para>  <mediaobject><imageobject><imagedata fileref="metadata"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject>
belaran@964:   <caption><para>Metadata relationships</para></caption>
belaran@964:   \label{fig:concepts:metadata}
belaran@964: </para>
belaran@964: </informalfigure>
belaran@964: 
belaran@964: <para>As the illustration shows, there is <emphasis>not</emphasis> a <quote>one to one</quote>
belaran@964: relationship between revisions in the changelog, manifest, or filelog.
belaran@964: If the manifest hasn't changed between two changesets, the changelog
belaran@964: entries for those changesets will point to the same revision of the
belaran@964: manifest.  If a file that Mercurial tracks hasn't changed between two
belaran@964: changesets, the entry for that file in the two revisions of the
belaran@964: manifest will point to the same revision of its filelog.
belaran@964: </para>
belaran@964: 
belaran@964: </sect2>
belaran@964: </sect1>
belaran@964: <sect1>
belaran@964: <title>Safe, efficient storage</title>
belaran@964: 
belaran@964: <para>The underpinnings of changelogs, manifests, and filelogs are provided
belaran@964: by a single structure called the <emphasis>revlog</emphasis>.
belaran@964: </para>
belaran@964: 
belaran@964: <sect2>
belaran@964: <title>Efficient storage</title>
belaran@964: 
belaran@964: <para>The revlog provides efficient storage of revisions using a
belaran@964: <emphasis>delta</emphasis> mechanism.  Instead of storing a complete copy of a file
belaran@964: for each revision, it stores the changes needed to transform an older
belaran@964: revision into the new revision.  For many kinds of file data, these
belaran@964: deltas are typically a fraction of a percent of the size of a full
belaran@964: copy of a file.
belaran@964: </para>
belaran@964: 
belaran@964: <para>Some obsolete revision control systems can only work with deltas of
belaran@964: text files.  They must either store binary files as complete snapshots
belaran@964: or encoded into a text representation, both of which are wasteful
belaran@964: approaches.  Mercurial can efficiently handle deltas of files with
belaran@964: arbitrary binary contents; it doesn't need to treat text as special.
belaran@964: </para>
belaran@964: 
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Safe operation</title>
belaran@964: <para>\label{sec:concepts:txn}
belaran@964: </para>
belaran@964: 
belaran@964: <para>Mercurial only ever <emphasis>appends</emphasis> data to the end of a revlog file.
belaran@964: It never modifies a section of a file after it has written it.  This
belaran@964: is both more robust and efficient than schemes that need to modify or
belaran@964: rewrite data.
belaran@964: </para>
belaran@964: 
belaran@964: <para>In addition, Mercurial treats every write as part of a
belaran@964: <emphasis>transaction</emphasis> that can span a number of files.  A transaction is
belaran@964: <emphasis>atomic</emphasis>: either the entire transaction succeeds and its effects
belaran@964: are all visible to readers in one go, or the whole thing is undone.
belaran@964: This guarantee of atomicity means that if you're running two copies of
belaran@964: Mercurial, where one is reading data and one is writing it, the reader
belaran@964: will never see a partially written result that might confuse it.
belaran@964: </para>
belaran@964: 
belaran@964: <para>The fact that Mercurial only appends to files makes it easier to
belaran@964: provide this transactional guarantee.  The easier it is to do stuff
belaran@964: like this, the more confident you should be that it's done correctly.
belaran@964: </para>
belaran@964: 
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Fast retrieval</title>
belaran@964: 
belaran@964: <para>Mercurial cleverly avoids a pitfall common to all earlier
belaran@964: revision control systems: the problem of <emphasis>inefficient retrieval</emphasis>.
belaran@964: Most revision control systems store the contents of a revision as an
belaran@964: incremental series of modifications against a <quote>snapshot</quote>.  To
belaran@964: reconstruct a specific revision, you must first read the snapshot, and
belaran@964: then every one of the revisions between the snapshot and your target
belaran@964: revision.  The more history that a file accumulates, the more
belaran@964: revisions you must read, hence the longer it takes to reconstruct a
belaran@964: particular revision.
belaran@964: </para>
belaran@964: 
belaran@964: <informalfigure>
belaran@964: 
belaran@964: <para>  <mediaobject><imageobject><imagedata fileref="snapshot"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject>
belaran@964:   <caption><para>Snapshot of a revlog, with incremental deltas</para></caption>
belaran@964:   \label{fig:concepts:snapshot}
belaran@964: </para>
belaran@964: </informalfigure>
belaran@964: 
belaran@964: <para>The innovation that Mercurial applies to this problem is simple but
belaran@964: effective.  Once the cumulative amount of delta information stored
belaran@964: since the last snapshot exceeds a fixed threshold, it stores a new
belaran@964: snapshot (compressed, of course), instead of another delta.  This
belaran@964: makes it possible to reconstruct <emphasis>any</emphasis> revision of a file
belaran@964: quickly.  This approach works so well that it has since been copied by
belaran@964: several other revision control systems.
belaran@964: </para>
belaran@964: 
belaran@964: <para>Figure <xref linkend="fig:concepts:snapshot"/> illustrates the idea.  In an entry
belaran@964: in a revlog's index file, Mercurial stores the range of entries from
belaran@964: the data file that it must read to reconstruct a particular revision.
belaran@964: </para>
belaran@964: 
belaran@964: <sect3>
belaran@964: <title>Aside: the influence of video compression</title>
belaran@964: 
belaran@964: <para>If you're familiar with video compression or have ever watched a TV
belaran@964: feed through a digital cable or satellite service, you may know that
belaran@964: most video compression schemes store each frame of video as a delta
belaran@964: against its predecessor frame.  In addition, these schemes use
belaran@964: <quote>lossy</quote> compression techniques to increase the compression ratio, so
belaran@964: visual errors accumulate over the course of a number of inter-frame
belaran@964: deltas.
belaran@964: </para>
belaran@964: 
belaran@964: <para>Because it's possible for a video stream to <quote>drop out</quote> occasionally
belaran@964: due to signal glitches, and to limit the accumulation of artefacts
belaran@964: introduced by the lossy compression process, video encoders
belaran@964: periodically insert a complete frame (called a <quote>key frame</quote>) into the
belaran@964: video stream; the next delta is generated against that frame.  This
belaran@964: means that if the video signal gets interrupted, it will resume once
belaran@964: the next key frame is received.  Also, the accumulation of encoding
belaran@964: errors restarts anew with each key frame.
belaran@964: </para>
belaran@964: 
belaran@964: </sect3>
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Identification and strong integrity</title>
belaran@964: 
belaran@964: <para>Along with delta or snapshot information, a revlog entry contains a
belaran@964: cryptographic hash of the data that it represents.  This makes it
belaran@964: difficult to forge the contents of a revision, and easy to detect
belaran@964: accidental corruption.
belaran@964: </para>
belaran@964: 
belaran@964: <para>Hashes provide more than a mere check against corruption; they are
belaran@964: used as the identifiers for revisions.  The changeset identification
belaran@964: hashes that you see as an end user are from revisions of the
belaran@964: changelog.  Although filelogs and the manifest also use hashes,
belaran@964: Mercurial only uses these behind the scenes.
belaran@964: </para>
belaran@964: 
belaran@964: <para>Mercurial verifies that hashes are correct when it retrieves file
belaran@964: revisions and when it pulls changes from another repository.  If it
belaran@964: encounters an integrity problem, it will complain and stop whatever
belaran@964: it's doing.
belaran@964: </para>
belaran@964: 
belaran@964: <para>In addition to the effect it has on retrieval efficiency, Mercurial's
belaran@964: use of periodic snapshots makes it more robust against partial data
belaran@964: corruption.  If a revlog becomes partly corrupted due to a hardware
belaran@964: error or system bug, it's often possible to reconstruct some or most
belaran@964: revisions from the uncorrupted sections of the revlog, both before and
belaran@964: after the corrupted section.  This would not be possible with a
belaran@964: delta-only storage model.
belaran@964: </para>
belaran@964: 
belaran@964: <para>\section{Revision history, branching,
belaran@964:   and merging}
belaran@964: </para>
belaran@964: 
belaran@964: <para>Every entry in a Mercurial revlog knows the identity of its immediate
belaran@964: ancestor revision, usually referred to as its <emphasis>parent</emphasis>.  In fact,
belaran@964: a revision contains room for not one parent, but two.  Mercurial uses
belaran@964: a special hash, called the <quote>null ID</quote>, to represent the idea <quote>there
belaran@964: is no parent here</quote>.  This hash is simply a string of zeroes.
belaran@964: </para>
belaran@964: 
belaran@964: <para>In figure <xref linkend="fig:concepts:revlog"/>, you can see an example of the
belaran@964: conceptual structure of a revlog.  Filelogs, manifests, and changelogs
belaran@964: all have this same structure; they differ only in the kind of data
belaran@964: stored in each delta or snapshot.
belaran@964: </para>
belaran@964: 
belaran@964: <para>The first revision in a revlog (at the bottom of the image) has the
belaran@964: null ID in both of its parent slots.  For a <quote>normal</quote> revision, its
belaran@964: first parent slot contains the ID of its parent revision, and its
belaran@964: second contains the null ID, indicating that the revision has only one
belaran@964: real parent.  Any two revisions that have the same parent ID are
belaran@964: branches.  A revision that represents a merge between branches has two
belaran@964: normal revision IDs in its parent slots.
belaran@964: </para>
belaran@964: 
belaran@964: <informalfigure>
belaran@964: 
belaran@964: <para>  <mediaobject><imageobject><imagedata fileref="revlog"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject>
belaran@964:   \caption{}
belaran@964:   \label{fig:concepts:revlog}
belaran@964: </para>
belaran@964: </informalfigure>
belaran@964: 
belaran@964: </sect2>
belaran@964: </sect1>
belaran@964: <sect1>
belaran@964: <title>The working directory</title>
belaran@964: 
belaran@964: <para>In the working directory, Mercurial stores a snapshot of the files
belaran@964: from the repository as of a particular changeset.
belaran@964: </para>
belaran@964: 
belaran@964: <para>The working directory <quote>knows</quote> which changeset it contains.  When you
belaran@964: update the working directory to contain a particular changeset,
belaran@964: Mercurial looks up the appropriate revision of the manifest to find
belaran@964: out which files it was tracking at the time that changeset was
belaran@964: committed, and which revision of each file was then current.  It then
belaran@964: recreates a copy of each of those files, with the same contents it had
belaran@964: when the changeset was committed.
belaran@964: </para>
belaran@964: 
belaran@964: <para>The <emphasis>dirstate</emphasis> contains Mercurial's knowledge of the working
belaran@964: directory.  This details which changeset the working directory is
belaran@964: updated to, and all of the files that Mercurial is tracking in the
belaran@964: working directory.
belaran@964: </para>
belaran@964: 
belaran@964: <para>Just as a revision of a revlog has room for two parents, so that it
belaran@964: can represent either a normal revision (with one parent) or a merge of
belaran@964: two earlier revisions, the dirstate has slots for two parents.  When
belaran@964: you use the <command role="hg-cmd">hg update</command> command, the changeset that you update to
belaran@964: is stored in the <quote>first parent</quote> slot, and the null ID in the second.
belaran@964: When you <command role="hg-cmd">hg merge</command> with another changeset, the first parent
belaran@964: remains unchanged, and the second parent is filled in with the
belaran@964: changeset you're merging with.  The <command role="hg-cmd">hg parents</command> command tells you
belaran@964: what the parents of the dirstate are.
belaran@964: </para>
belaran@964: 
belaran@964: <sect2>
belaran@964: <title>What happens when you commit</title>
belaran@964: 
belaran@964: <para>The dirstate stores parent information for more than just book-keeping
belaran@964: purposes.  Mercurial uses the parents of the dirstate as \emph{the
belaran@964:   parents of a new changeset} when you perform a commit.
belaran@964: </para>
belaran@964: 
belaran@964: <informalfigure>
belaran@964: 
belaran@964: <para>  <mediaobject><imageobject><imagedata fileref="wdir"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject>
belaran@964:   <caption><para>The working directory can have two parents</para></caption>
belaran@964:   \label{fig:concepts:wdir}
belaran@964: </para>
belaran@964: </informalfigure>
belaran@964: 
belaran@964: <para>Figure <xref linkend="fig:concepts:wdir"/> shows the normal state of the working
belaran@964: directory, where it has a single changeset as parent.  That changeset
belaran@964: is the <emphasis>tip</emphasis>, the newest changeset in the repository that has no
belaran@964: children.
belaran@964: </para>
belaran@964: 
belaran@964: <informalfigure>
belaran@964: 
belaran@964: <para>  <mediaobject><imageobject><imagedata fileref="wdir-after-commit"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject>
belaran@964:   <caption><para>The working directory gains new parents after a commit</para></caption>
belaran@964:   \label{fig:concepts:wdir-after-commit}
belaran@964: </para>
belaran@964: </informalfigure>
belaran@964: 
belaran@964: <para>It's useful to think of the working directory as <quote>the changeset I'm
belaran@964: about to commit</quote>.  Any files that you tell Mercurial that you've
belaran@964: added, removed, renamed, or copied will be reflected in that
belaran@964: changeset, as will modifications to any files that Mercurial is
belaran@964: already tracking; the new changeset will have the parents of the
belaran@964: working directory as its parents.
belaran@964: </para>
belaran@964: 
belaran@964: <para>After a commit, Mercurial will update the parents of the working
belaran@964: directory, so that the first parent is the ID of the new changeset,
belaran@964: and the second is the null ID.  This is shown in
belaran@964: figure <xref linkend="fig:concepts:wdir-after-commit"/>.  Mercurial doesn't touch
belaran@964: any of the files in the working directory when you commit; it just
belaran@964: modifies the dirstate to note its new parents.
belaran@964: </para>
belaran@964: 
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Creating a new head</title>
belaran@964: 
belaran@964: <para>It's perfectly normal to update the working directory to a changeset
belaran@964: other than the current tip.  For example, you might want to know what
belaran@964: your project looked like last Tuesday, or you could be looking through
belaran@964: changesets to see which one introduced a bug.  In cases like this, the
belaran@964: natural thing to do is update the working directory to the changeset
belaran@964: you're interested in, and then examine the files in the working
belaran@964: directory directly to see their contents as they were when you
belaran@964: committed that changeset.  The effect of this is shown in
belaran@964: figure <xref linkend="fig:concepts:wdir-pre-branch"/>.
belaran@964: </para>
belaran@964: 
belaran@964: <informalfigure>
belaran@964: 
belaran@964: <para>  <mediaobject><imageobject><imagedata fileref="wdir-pre-branch"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject>
belaran@964:   <caption><para>The working directory, updated to an older changeset</para></caption>
belaran@964:   \label{fig:concepts:wdir-pre-branch}
belaran@964: </para>
belaran@964: </informalfigure>
belaran@964: 
belaran@964: <para>Having updated the working directory to an older changeset, what
belaran@964: happens if you make some changes, and then commit?  Mercurial behaves
belaran@964: in the same way as I outlined above.  The parents of the working
belaran@964: directory become the parents of the new changeset.  This new changeset
belaran@964: has no children, so it becomes the new tip.  And the repository now
belaran@964: contains two changesets that have no children; we call these
belaran@964: <emphasis>heads</emphasis>.  You can see the structure that this creates in
belaran@964: figure <xref linkend="fig:concepts:wdir-branch"/>.
belaran@964: </para>
belaran@964: 
belaran@964: <informalfigure>
belaran@964: 
belaran@964: <para>  <mediaobject><imageobject><imagedata fileref="wdir-branch"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject>
belaran@964:   <caption><para>After a commit made while synced to an older changeset</para></caption>
belaran@964:   \label{fig:concepts:wdir-branch}
belaran@964: </para>
belaran@964: </informalfigure>
belaran@964: 
belaran@964: <note>
belaran@964: <para>  If you're new to Mercurial, you should keep in mind a common
belaran@964:   <quote>error</quote>, which is to use the <command role="hg-cmd">hg pull</command> command without any
belaran@964:   options.  By default, the <command role="hg-cmd">hg pull</command> command <emphasis>does not</emphasis>
belaran@964:   update the working directory, so you'll bring new changesets into
belaran@964:   your repository, but the working directory will stay synced at the
belaran@964:   same changeset as before the pull.  If you make some changes and
belaran@964:   commit afterwards, you'll thus create a new head, because your
belaran@964:   working directory isn't synced to whatever the current tip is.
belaran@964: </para>
belaran@964: 
belaran@964: <para>  I put the word <quote>error</quote> in quotes because all that you need to do
belaran@964:   to rectify this situation is <command role="hg-cmd">hg merge</command>, then <command role="hg-cmd">hg commit</command>.  In
belaran@964:   other words, this almost never has negative consequences; it just
belaran@964:   surprises people.  I'll discuss other ways to avoid this behaviour,
belaran@964:   and why Mercurial behaves in this initially surprising way, later
belaran@964:   on.
belaran@964: </para>
belaran@964: </note>
belaran@964: 
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Merging heads</title>
belaran@964: 
belaran@964: <para>When you run the <command role="hg-cmd">hg merge</command> command, Mercurial leaves the first
belaran@964: parent of the working directory unchanged, and sets the second parent
belaran@964: to the changeset you're merging with, as shown in
belaran@964: figure <xref linkend="fig:concepts:wdir-merge"/>.
belaran@964: </para>
belaran@964: 
belaran@964: <informalfigure>
belaran@964: 
belaran@964: <para>  <mediaobject><imageobject><imagedata fileref="wdir-merge"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject>
belaran@964:   <caption><para>Merging two heads</para></caption>
belaran@964:   \label{fig:concepts:wdir-merge}
belaran@964: </para>
belaran@964: </informalfigure>
belaran@964: 
belaran@964: <para>Mercurial also has to modify the working directory, to merge the files
belaran@964: managed in the two changesets.  Simplified a little, the merging
belaran@964: process goes like this, for every file in the manifests of both
belaran@964: changesets.
belaran@964: </para>
belaran@964: <itemizedlist>
belaran@964: <listitem><para>If neither changeset has modified a file, do nothing with that
belaran@964:   file.
belaran@964: </para>
belaran@964: </listitem>
belaran@964: <listitem><para>If one changeset has modified a file, and the other hasn't,
belaran@964:   create the modified copy of the file in the working directory.
belaran@964: </para>
belaran@964: </listitem>
belaran@964: <listitem><para>If one changeset has removed a file, and the other hasn't (or
belaran@964:   has also deleted it), delete the file from the working directory.
belaran@964: </para>
belaran@964: </listitem>
belaran@964: <listitem><para>If one changeset has removed a file, but the other has modified
belaran@964:   the file, ask the user what to do: keep the modified file, or remove
belaran@964:   it?
belaran@964: </para>
belaran@964: </listitem>
belaran@964: <listitem><para>If both changesets have modified a file, invoke an external
belaran@964:   merge program to choose the new contents for the merged file.  This
belaran@964:   may require input from the user.
belaran@964: </para>
belaran@964: </listitem>
belaran@964: <listitem><para>If one changeset has modified a file, and the other has renamed
belaran@964:   or copied the file, make sure that the changes follow the new name
belaran@964:   of the file.
belaran@964: </para>
belaran@964: </listitem></itemizedlist>
belaran@964: <para>There are more details&emdash;merging has plenty of corner cases&emdash;but
belaran@964: these are the most common choices that are involved in a merge.  As
belaran@964: you can see, most cases are completely automatic, and indeed most
belaran@964: merges finish automatically, without requiring your input to resolve
belaran@964: any conflicts.
belaran@964: </para>
belaran@964: 
belaran@964: <para>When you're thinking about what happens when you commit after a merge,
belaran@964: once again the working directory is <quote>the changeset I'm about to
belaran@964: commit</quote>.  After the <command role="hg-cmd">hg merge</command> command completes, the working
belaran@964: directory has two parents; these will become the parents of the new
belaran@964: changeset.
belaran@964: </para>
belaran@964: 
belaran@964: <para>Mercurial lets you perform multiple merges, but you must commit the
belaran@964: results of each individual merge as you go.  This is necessary because
belaran@964: Mercurial only tracks two parents for both revisions and the working
belaran@964: directory.  While it would be technically possible to merge multiple
belaran@964: changesets at once, the prospect of user confusion and making a
belaran@964: terrible mess of a merge immediately becomes overwhelming.
belaran@964: </para>
belaran@964: 
belaran@964: </sect2>
belaran@964: </sect1>
belaran@964: <sect1>
belaran@964: <title>Other interesting design features</title>
belaran@964: 
belaran@964: <para>In the sections above, I've tried to highlight some of the most
belaran@964: important aspects of Mercurial's design, to illustrate that it pays
belaran@964: careful attention to reliability and performance.  However, the
belaran@964: attention to detail doesn't stop there.  There are a number of other
belaran@964: aspects of Mercurial's construction that I personally find
belaran@964: interesting.  I'll detail a few of them here, separate from the <quote>big
belaran@964: ticket</quote> items above, so that if you're interested, you can gain a
belaran@964: better idea of the amount of thinking that goes into a well-designed
belaran@964: system.
belaran@964: </para>
belaran@964: 
belaran@964: <sect2>
belaran@964: <title>Clever compression</title>
belaran@964: 
belaran@964: <para>When appropriate, Mercurial will store both snapshots and deltas in
belaran@964: compressed form.  It does this by always <emphasis>trying to</emphasis> compress a
belaran@964: snapshot or delta, but only storing the compressed version if it's
belaran@964: smaller than the uncompressed version.
belaran@964: </para>
belaran@964: 
belaran@964: <para>This means that Mercurial does <quote>the right thing</quote> when storing a file
belaran@964: whose native form is compressed, such as a <literal>zip</literal> archive or a
belaran@964: JPEG image.  When these types of files are compressed a second time,
belaran@964: the resulting file is usually bigger than the once-compressed form,
belaran@964: and so Mercurial will store the plain <literal>zip</literal> or JPEG.
belaran@964: </para>
belaran@964: 
belaran@964: <para>Deltas between revisions of a compressed file are usually larger than
belaran@964: snapshots of the file, and Mercurial again does <quote>the right thing</quote> in
belaran@964: these cases.  It finds that such a delta exceeds the threshold at
belaran@964: which it should store a complete snapshot of the file, so it stores
belaran@964: the snapshot, again saving space compared to a naive delta-only
belaran@964: approach.
belaran@964: </para>
belaran@964: 
belaran@964: <sect3>
belaran@964: <title>Network recompression</title>
belaran@964: 
belaran@964: <para>When storing revisions on disk, Mercurial uses the <quote>deflate</quote>
belaran@964: compression algorithm (the same one used by the popular <literal>zip</literal>
belaran@964: archive format), which balances good speed with a respectable
belaran@964: compression ratio.  However, when transmitting revision data over a
belaran@964: network connection, Mercurial uncompresses the compressed revision
belaran@964: data.
belaran@964: </para>
belaran@964: 
belaran@964: <para>If the connection is over HTTP, Mercurial recompresses the entire
belaran@964: stream of data using a compression algorithm that gives a better
belaran@964: compression ratio (the Burrows-Wheeler algorithm from the widely used
belaran@964: <literal>bzip2</literal> compression package).  This combination of algorithm
belaran@964: and compression of the entire stream (instead of a revision at a time)
belaran@964: substantially reduces the number of bytes to be transferred, yielding
belaran@964: better network performance over almost all kinds of network.
belaran@964: </para>
belaran@964: 
belaran@964: <para>(If the connection is over <command>ssh</command>, Mercurial <emphasis>doesn't</emphasis>
belaran@964: recompress the stream, because <command>ssh</command> can already do this
belaran@964: itself.)
belaran@964: </para>
belaran@964: 
belaran@964: </sect3>
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Read/write ordering and atomicity</title>
belaran@964: 
belaran@964: <para>Appending to files isn't the whole story when it comes to guaranteeing
belaran@964: that a reader won't see a partial write.  If you recall
belaran@964: figure <xref linkend="fig:concepts:metadata"/>, revisions in the changelog point to
belaran@964: revisions in the manifest, and revisions in the manifest point to
belaran@964: revisions in filelogs.  This hierarchy is deliberate.
belaran@964: </para>
belaran@964: 
belaran@964: <para>A writer starts a transaction by writing filelog and manifest data,
belaran@964: and doesn't write any changelog data until those are finished.  A
belaran@964: reader starts by reading changelog data, then manifest data, followed
belaran@964: by filelog data.
belaran@964: </para>
belaran@964: 
belaran@964: <para>Since the writer has always finished writing filelog and manifest data
belaran@964: before it writes to the changelog, a reader will never read a pointer
belaran@964: to a partially written manifest revision from the changelog, and it will
belaran@964: never read a pointer to a partially written filelog revision from the
belaran@964: manifest.
belaran@964: </para>
belaran@964: 
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Concurrent access</title>
belaran@964: 
belaran@964: <para>The read/write ordering and atomicity guarantees mean that Mercurial
belaran@964: never needs to <emphasis>lock</emphasis> a repository when it's reading data, even
belaran@964: if the repository is being written to while the read is occurring.
belaran@964: This has a big effect on scalability; you can have an arbitrary number
belaran@964: of Mercurial processes safely reading data from a repository safely
belaran@964: all at once, no matter whether it's being written to or not.
belaran@964: </para>
belaran@964: 
belaran@964: <para>The lockless nature of reading means that if you're sharing a
belaran@964: repository on a multi-user system, you don't need to grant other local
belaran@964: users permission to <emphasis>write</emphasis> to your repository in order for them
belaran@964: to be able to clone it or pull changes from it; they only need
belaran@964: <emphasis>read</emphasis> permission.  (This is <emphasis>not</emphasis> a common feature among
belaran@964: revision control systems, so don't take it for granted!  Most require
belaran@964: readers to be able to lock a repository to access it safely, and this
belaran@964: requires write permission on at least one directory, which of course
belaran@964: makes for all kinds of nasty and annoying security and administrative
belaran@964: problems.)
belaran@964: </para>
belaran@964: 
belaran@964: <para>Mercurial uses locks to ensure that only one process can write to a
belaran@964: repository at a time (the locking mechanism is safe even over
belaran@964: filesystems that are notoriously hostile to locking, such as NFS).  If
belaran@964: a repository is locked, a writer will wait for a while to retry if the
belaran@964: repository becomes unlocked, but if the repository remains locked for
belaran@964: too long, the process attempting to write will time out after a while.
belaran@964: This means that your daily automated scripts won't get stuck forever
belaran@964: and pile up if a system crashes unnoticed, for example.  (Yes, the
belaran@964: timeout is configurable, from zero to infinity.)
belaran@964: </para>
belaran@964: 
belaran@964: <sect3>
belaran@964: <title>Safe dirstate access</title>
belaran@964: 
belaran@964: <para>As with revision data, Mercurial doesn't take a lock to read the
belaran@964: dirstate file; it does acquire a lock to write it.  To avoid the
belaran@964: possibility of reading a partially written copy of the dirstate file,
belaran@964: Mercurial writes to a file with a unique name in the same directory as
belaran@964: the dirstate file, then renames the temporary file atomically to
belaran@964: <filename>dirstate</filename>.  The file named <filename>dirstate</filename> is thus
belaran@964: guaranteed to be complete, not partially written.
belaran@964: </para>
belaran@964: 
belaran@964: </sect3>
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Avoiding seeks</title>
belaran@964: 
belaran@964: <para>Critical to Mercurial's performance is the avoidance of seeks of the
belaran@964: disk head, since any seek is far more expensive than even a
belaran@964: comparatively large read operation.
belaran@964: </para>
belaran@964: 
belaran@964: <para>This is why, for example, the dirstate is stored in a single file.  If
belaran@964: there were a dirstate file per directory that Mercurial tracked, the
belaran@964: disk would seek once per directory.  Instead, Mercurial reads the
belaran@964: entire single dirstate file in one step.
belaran@964: </para>
belaran@964: 
belaran@964: <para>Mercurial also uses a <quote>copy on write</quote> scheme when cloning a
belaran@964: repository on local storage.  Instead of copying every revlog file
belaran@964: from the old repository into the new repository, it makes a <quote>hard
belaran@964: link</quote>, which is a shorthand way to say <quote>these two names point to the
belaran@964: same file</quote>.  When Mercurial is about to write to one of a revlog's
belaran@964: files, it checks to see if the number of names pointing at the file is
belaran@964: greater than one.  If it is, more than one repository is using the
belaran@964: file, so Mercurial makes a new copy of the file that is private to
belaran@964: this repository.
belaran@964: </para>
belaran@964: 
belaran@964: <para>A few revision control developers have pointed out that this idea of
belaran@964: making a complete private copy of a file is not very efficient in its
belaran@964: use of storage.  While this is true, storage is cheap, and this method
belaran@964: gives the highest performance while deferring most book-keeping to the
belaran@964: operating system.  An alternative scheme would most likely reduce
belaran@964: performance and increase the complexity of the software, each of which
belaran@964: is much more important to the <quote>feel</quote> of day-to-day use.
belaran@964: </para>
belaran@964: 
belaran@964: </sect2>
belaran@964: <sect2>
belaran@964: <title>Other contents of the dirstate</title>
belaran@964: 
belaran@964: <para>Because Mercurial doesn't force you to tell it when you're modifying a
belaran@964: file, it uses the dirstate to store some extra information so it can
belaran@964: determine efficiently whether you have modified a file.  For each file
belaran@964: in the working directory, it stores the time that it last modified the
belaran@964: file itself, and the size of the file at that time.
belaran@964: </para>
belaran@964: 
belaran@964: <para>When you explicitly <command role="hg-cmd">hg add</command>, <command role="hg-cmd">hg remove</command>, <command role="hg-cmd">hg rename</command> or
belaran@964: <command role="hg-cmd">hg copy</command> files, Mercurial updates the dirstate so that it knows
belaran@964: what to do with those files when you commit.
belaran@964: </para>
belaran@964: 
belaran@964: <para>When Mercurial is checking the states of files in the working
belaran@964: directory, it first checks a file's modification time.  If that has
belaran@964: not changed, the file must not have been modified.  If the file's size
belaran@964: has changed, the file must have been modified.  If the modification
belaran@964: time has changed, but the size has not, only then does Mercurial need
belaran@964: to read the actual contents of the file to see if they've changed.
belaran@964: Storing these few extra pieces of information dramatically reduces the
belaran@964: amount of data that Mercurial needs to read, which yields large
belaran@964: performance improvements compared to other revision control systems.
belaran@964: </para>
belaran@964: 
belaran@964: </sect2>
belaran@964: </sect1>
belaran@964: </chapter>
belaran@964: 
belaran@964: <!--
belaran@964: local variables: 
belaran@964: sgml-parent-document: ("00book.xml" "book" "chapter")
belaran@964: end:
belaran@964: -->