%
% JFFS3 design issues.
%
% Copyright (C), 2005, Artem B. Bityutskiy, <dedekind@infradead.org>
%
% $Id: intro.tex,v 1.5 2005/11/27 14:37:50 dedekind Exp $
%

The main idea how to fix in JFFS2 to make it scalable is \emph{to move the
index from RAM to flash}. Unfortunately, this requires complete JFFS2 redesign
and \mbox{re-implementation} and the design of JFFS3 is largely different to
the design of JFFS2. This section discusses the base JFFS3 design ideas without
any detailed description.

%
% INDEXING PROBLEM
%
\subsection{Indexing problem}

There is a large difference between block devices and flash devices in how they
allow to update the contents of a sector. Block devices admit of
\mbox{so-called} "\emph{\mbox{in-place} updates}", i.e. the update may be
written straight to the sector. Flash devices do not allow this unless the
whole eraseblock has been erased before.

Obviously, it is unacceptable to erase the whole eraseblock each time a sector
is updated. Instead, \mbox{so-called} "\emph{\mbox{out-of-place} updates}"
technique is usually used. This simply means that no attempts to update
sectors \mbox{in-place} are made but instead, updates are written to some
other sector and the contents of the previous sector is afterwords regarded as
garbage.

This "\mbox{out-of-place} writes" property of flash devices assumes that JFFS3
also has \mbox{log-structured} design as in JFFS3 any update is written
\mbox{out-of-place}. And it seems that this is natural for any flash file
system to have log-structured design.

It is interesting to notice that in \mbox{log-structured} file systems for
block devices (like the one described in~[\ref{ref_LFS}]) not any update is
"\mbox{out-of-place}". There are always some \mbox{fixed-position} sectors
present. These sectors usually refer the file system index, admit of quick
file system mount and they are updated \mbox{in-place}.

But flash devices have limited number of erase cycles for each eraseblock and
it is impossible to guaranty good \mbox{wear-levelling} if some eraseblocks are
reserved for similar purposes. So, it is important that \emph{in JFFS3 there
are no \mbox{in-place} updates} as good \mbox{wear-levelling} is one of the
main requirements to JFFS3 (see section~\ref{ref_SectionJFFS3Req}).

The "\mbox{out-of-place} updates" property makes it difficult to maintain the
index on the flash media. Figure~\ref{ref_FigureIndexProblem} demonstrates why.

%
% Figure with index problem example
%
\begin{figure}[h]
\begin{center}
\begin{htmlonly}
\includegraphics{pics/idxprobl.png}
\end{htmlonly}
%begin{latexonly}
\includegraphics[width=90mm,height=45mm]{pics/idxprobl.pdf}
%end{latexonly}
\end{center}
\caption{JFFS3 indexing problem example.}
\label{ref_FigureIndexProblem}
\end{figure}

Suppose the index is kept and maintained on flash and it consists of 4
parts $A$, $B$, $C$, and $D$ which refer each other: $A$ refers $B$ and $C$,
$B$ refers $D$, and $C$ refers $D$. This means, that $A$ contains the physical
flash address of $B$ and $C$ and so on.

Suppose $D$ should be updated. Since it is updated \mbox{out-of-place}, the
newer version $D_1$ is written to some other place. But there are $B$ and $C$
which still refer $D$, not $D_1$, and they ought to be updated as well. And
when they are updated \mbox{out-of-place}, $A$ will still refer the old $B$ and
$C$, and so on. Thus, it is not that trivial to store and maintain indexing
information on the flash media.

%
% WANDERING TREES
%
\subsection{Wandering trees} \label{ref_SectionWandTrees}

To address the above problem it is possible to use \emph{wandering trees}.
Figure~\ref{ref_FigureWandtree} demonstrates how do wandering trees work.

%
% Figure with the wandering tree example
%
\begin{figure}[h]
\begin{center}
\begin{htmlonly}
\includegraphics{pics/wandtree.png}
\end{htmlonly}
%begin{latexonly}
\includegraphics[width=159mm,height=80mm]{pics/wandtree.pdf}
%end{latexonly}
\end{center}
\caption{Wandering tree example.}
\label{ref_FigureWandtree}
\end{figure}

\begin{enumerate}

\item Suppose that the index is a tree and it is stored and maintained
on the flash media. The tree consists of nodes
$A$, $B$, $C$, $D$, $E$, $F$, $G$, and $H$.
Suppose node $H$ should be updated.

\item At first, the updated version $H_1$ is written. Obviously, $F$ still
refers $H$.

\item Now the corresponding link in node $F$ is changed and node $F_1$ is
written to flash. $F_1$ refers $H_1$. But as $F_1$ is also written
\mbox{out-of-place}, $A$ still refers the old node $F$.

\item Finally, the new root node $A_1$ is written ant it refers $F_1$.

\item Nodes $A$, $F$, $H$ are now treated as garbage and the updated tree is
composed by nodes $A_1$, $B$, $C$, $D$, $E$, $F_1$, $G$, and $H_1$.

\end{enumerate}

So, wandering trees is the base idea of how the indexing information is going
to be maintained on the flash media in JFFS3. And it stands to reason that any
tree may be called "wandering tree" if any update in the tree requires updating
parent nodes up to the root. For example, it makes sense to talk about
wandering \mbox{Red-Black} trees or wandering \mbox{$B^+$-trees} and so forth.

%
% B+ TREES
%
\subsection{B-trees} \label{ref_SectionBTrees}

JFFS3 uses \mbox{$B^+$-trees} and this subsection makes a short introduction to
\mbox{$B^+$-trees}. There is a plenty of books where one may find more
information. There are also many \mbox{on-line} resources available, e.g.
[\ref{ref_BTrees}].

The inexact definition of \mbox{$B$-tree} may be formulated as a balanced
search tree where each node may have many children. The \emph{branching factor}
or the \emph{fanout} defines the maximal number of node's children. While
\mbox{$B$-trees} may contain both useful data and \emph{keys} and \emph{links}
in \mbox{non-leaf} nodes, \mbox{$B^+$-trees} are \mbox{$B$-trees} which store
data only in leaf nodes, while \mbox{non-leaf} nodes contain only keys and
links.

%
% The figure with B+-tree example
%
\begin{figure}[h]
\begin{center}
\begin{htmlonly}
\includegraphics{pics/btree-01.png}
\end{htmlonly}
%begin{latexonly}
\includegraphics[width=159mm,height=65mm]{pics/btree-01.pdf}
%end{latexonly}
\end{center}
\caption{$B^+$-tree example.}
\label{ref_FigureBTree_01}
\end{figure}

Figure~\ref{ref_FigureBTree_01} demonstrates a \mbox{$B^+$-tree} with branching
factor $n$ and the number of level $L$. Note, that in JFFS3 levels are numbered
starting from \emph{leaf} nodes (level 0) and ending at the \emph{root} node
(level $L-1$).

Leaf nodes in the \mbox{$B^+$-tree} contain data which are indexed by
keys. \mbox{Non-leaf} nodes do not contain data, but contain only the indexing
information, namely, \emph{keys} and \emph{links}.

%
% Figure with B+-tree non-leaf node
%
\begin{figure}[h]
\begin{center}
\begin{htmlonly}
\includegraphics{pics/node-01.png}
\end{htmlonly}
%begin{latexonly}
\includegraphics[width=160mm,height=60mm]{pics/node-01.pdf}
%end{latexonly}
\end{center}
\caption{The structure of a non-leaf node in $B^+$-tree.}
\label{ref_FigureNode_01}
\end{figure}

Figure~\ref{ref_FigureNode_01} depicts the structure of a \mbox{non-leaf}
node. There are $n$ links and $n-1$ keys in the node. Links may point to either
leaf nodes or other \mbox{non-leaf} nodes. In the former case, the leaf node
will contain data which corresponds to the key which follows the link. In the
latter case, the pointed \mbox{non-leaf} node (and the whole subtree with the
root in this \mbox{non-leaf} node) will contain more keys in range
$(Key~1, Key~2]$.

Keys are sorted in the ascending order in \mbox{non-leaf} nodes, so it is not
that difficult to lookup data corresponding to any key. Furthermore, the tree
is balanced, so the the number of lookup steps does not depend on the key.

When objects are inserted or removed from the tree, \mbox{re-balancing} may be
needed. The tree is \mbox{re-balanced} by means of splitting nodes or merging
them and there is a simple enough algorithm exists. Please, refer to Donald
Knuth's books for more information about \mbox{re-balancing}
\mbox{$B^+$-trees}.

\mbox{$B^+$-trees} are widely used when working with block devices (e.g., hard
drives). Indeed, these devices have a fixed input/output unit size (usually
referred to as a \emph{sector}) and it is natural to use \mbox{$B^+$-trees}
with node size multiple to the size of the sector in order to store information
on such devices.

%
% INDEXING IN JFFS3
%
\subsection{Indexing in JFFS3} \label{ref_SectionIndexing}

The way how JFFS3 stores and indexes the file system is similar to the approach
used by the \emph{Reiser4} file system (see~[\ref{ref_Reiser4}]). All the file
system objects (inodes, files, directory entries, extended attributes, etc) are
kept in one large \mbox{$B^+$-tree}. Effectively, the whole JFFS3 file system
may be regarded as one large \mbox{$B^+$-tree}. This tree is further referred
to just as "\emph{the tree}".

Every object which is stored in the tree has a \emph{key}, and objects are
found in the tree by their keys. To make it clearer what are object keys, the
following is an example of how they may look like:

\begin{itemize}

\item file data key: \{\texttt{inode number}, \texttt{offset}\};

\item directory entry key: \{\texttt{parent directory inode number},
\texttt{direntry name hash}\} and the like.

\end{itemize}

The following are terms which are used in JFFS3 to refer nodes of different
levels in the tree:

\begin{itemize}

\item nodes of level 0 are \emph{leaf nodes};

\item nodes of level 1 are \emph{twig nodes};

\item nodes which are not the root, not leaf, and not twig are \emph{branch
nodes};

\item \mbox{no-leaf} nodes (i.e., the root, branch and twig) are
\emph{indexing nodes}.

\end{itemize}

Note, the same terminology (except indexing nodes) is used in the Reiser4 file
system~[\ref{ref_Reiser4}].

\mbox{Non-leaf} nodes are called "indexing nodes" because they contain only
indexing information, nothing else. No file system data is kept in the indexing
nodes. Indexing nodes have fixed size which is equivalent to the flash
\emph{sector} size.

It is important to note that somewhat unusual terminology is used in this
document. The smallest input/output unit of the flash chip is called a
\emph{sector}. Since JFFS3 mainly orients to NAND flashes, the sector is mostly
the NAND page and is either 512 bytes or 2 Kilobytes. For other flash types the
sector may be different. If flash's minimal input/output unit is very small
(say, one bit as in case of NOR flash), there should be a layer which emulates
larger sectors (say, 512 bytes).

In opposite to indexing nodes, leaf nodes have flexible size, just like nodes
in JFFS2. So, roughly speaking, JFFS3 file system may be considered as JFFS2
file system (leaf nodes) plus indexing information (indexing nodes) (see
figure~\ref{ref_FigureBTree_02}). 

%
% Figure the JFFS3 tree example
%
\begin{figure}[h]
\begin{center}
\begin{htmlonly}
\includegraphics{pics/btree-02.png}
\end{htmlonly}
%begin{latexonly}
\includegraphics[width=159mm,height=65mm]{pics/btree-02.pdf}
%end{latexonly}
\end{center}
\caption{The JFFS3 tree.}
\label{ref_FigureBTree_02}
\end{figure}

Similarly to JFFS2, leaf nodes consist of \emph{header} and \emph{data}. The
header describes the node data and contains information like the key of the
node, the length, and the like. Node data contains some file system data, for
example a directory entry, file's contents, etc. 

Leaf and indexing nodes are physically separated, which means that there are
eraseblocks with only indexing nodes and with only leaf nodes. But of course,
this does not mean that the whole flash partition is divided on two parts, this
only means that the indexing and leaf nodes are not in one eraseblock.
Figure~\ref{ref_FigureFlash_01} illustrates this.

%
% Figure with demonstration leaf and indexing nodes separation.
%
\begin{figure}[h]
\begin{center}
\begin{htmlonly}
\includegraphics{pics/flash-01.png}
\end{htmlonly}
%begin{latexonly}
\includegraphics[width=159mm,height=25mm]{pics/flash-01.pdf}
%end{latexonly}
\end{center}
\caption{An illustration of leaf and indexing nodes separation.}
\label{ref_FigureFlash_01}
\end{figure}

Eraseblocks which contain only indexing nodes are called \emph{indexing
eraseblocks} and those with leaf nodes are called \emph{leaf eraseblocks}.

The depth of the tree depends on how many objects are kept in the file system.
The more files, directories, etc are present in the file system, the deeper is
the tree. Fortunately, the number of tree levels grows very slowly with the
growing number of file system objects and the tree lookup scales as
$O(log_n{S})$ (logarithmically).

The following are advantages of the JFFS3 indexing approach.

\begin{itemize}

\item Many different key assignment schemes may be used and this gives a
flexibility in how objects are sorted in the tree. Thus, one may optimize JFFS3
for specific workloads by means of changing the format of the keys.

\item Leaf nodes may be compressed, so JFFS3 admits of the \mbox{on-flight}
compression.

\item In case of corruptions of the indexing information it is possible to
\mbox{re-create} it by means of scanning leaf nodes' headers.

\item There is a clear separation between data and indexing information. This
implies that the indexing information and data may be cached separately,
without overlapping in the same cache lines. This leads to better cache usage
as described in the Reiser4 paper~[\ref{ref_Reiser4}].

\end{itemize}

%
% INDEXING EXAMPLE
%
\subsection{Indexing example}

This section illustrates how does JFFS3 indexing work by means of a simple
example. The example is very rough but it shows JFFS3 indexing in action. It is
assumed that keys of direntries and data objects have the same layout that
is mentioned in section~\ref{ref_SectionIndexing}.

Suppose that user does the following:

\begin{enumerate}

\item mounts JFFS3 file system to "\texttt{/mnt/jffs3}" directory;

\item issues "\texttt{ls /mnt/jffs3}" command;

\item reads the contents of "\texttt{/mnt/jffs3/my\_file}" file.

\end{enumerate}

The following are comments about what is going on in JFFS3 during the above
steps.

\begin{enumerate}

\item During mount JFFS3 locates the position of the root node. This is done
with help of the JFFS3 \emph{superblock} which will be described later
(see section~\ref{ref_SectionSB}).

\item To get the list of directory entries in the root directory, JFFS3
looks~up all objects matching the \{\texttt{2},~$*$\} key pattern. Indeed,
direntry keys have \{\texttt{parent~inode~\#},~\texttt{name~hash}\} format, the
root directory inode number is 2 (or another predefined constant).  $*$" means
wildcard. Thus, JFFS3, \{\texttt{2},~$*$\} will match to any direntry in the
root directory.

\item To read the "\texttt{my\_file}" file, JFFS3 first needs to find out its
inode number. The inode number is stored in the directory entry object. Hence,
JFFS3 reads \texttt{my\_file}'s direntry using
\{\texttt{1},~$H($\texttt{"my\_file"$)$}\} key ($H()$ is the hash function).

Then JFFS3 searches for \texttt{my\_file}'s data objects using
\{$I$,~\texttt{offset}\} keys. Depending on which part of file should be read,
\texttt{offset} may take different values.

\end{enumerate}

The above description is somewhat simplified e.g., JFFS3 also needs to read
\texttt{my\_file}'s attr-data object to fetch the inode length from there (see
section \ref{ref_SectionObjects}), etc. But the aim of the section is just to
provide an idea of how JFFS3 indexing works.

%
% THE JOURNAL
%
\subsection{The Journal} \label{ref_SectionJournalIntro}

The JFFS3 tree is both \mbox{$B^+$-tree} and wandering tree. Any file system
change implies that a new node is written to the flash media, which in turn,
assumes that a number of indexing nodes must be updated. Namely, the whole path
of indexing nodes up to the root node should be updated (see
section~\ref{ref_SectionWandTrees}).

Evidently, it is very expensive to update several indexing nodes on each file
system change and \emph{the journal} provides a mechanism to avoid this.

The journal consists of a set of eraseblocks (the \emph{journal eraseblocks})
which do not have a fixed location on flash and are not contiguous on flash.
Any flash eraseblock may be used as an journal eraseblock.

%
% Figure with journal illustration
%
\begin{figure}[h]
\begin{center}
\begin{htmlonly}
\includegraphics{pics/journal-01.png}
\end{htmlonly}
%begin{latexonly}
\includegraphics[width=159mm,height=60mm]{pics/journal-01.pdf}
%end{latexonly}
\end{center}
\caption{The JFFS3 journal.}
\label{ref_FigureJournal_01}
\end{figure}

When something is changed in the JFFS3 file system, the corresponding leaf node
is written to the journal, but the corresponding indexing nodes are not
updated. Instead, JFFS3 keeps track of file system changes in RAM in a data
structure called \emph{the journal tree} (see
figure~\ref{ref_FigureJournal_01}).

When something is read from the file system, JFFS3 first glimpses at the
\mbox{in-RAM} journal tree to figure out if the needed data is in the journal.
If the data are there, the journal is read, otherwise JFFS3 performs the usual
tree lookup (see figure~\ref{ref_FigureJournal_02}).

%
% Figure with write request
%
\begin{figure}[h]
\begin{center}
\begin{htmlonly}
\includegraphics{pics/journal-02.png}
\end{htmlonly}
%begin{latexonly}
\includegraphics[width=120mm,height=55mm]{pics/journal-02.pdf}
%end{latexonly}
\end{center}
\caption{Read request processing in JFFS3.}
\label{ref_FigureJournal_02}
\end{figure}

The journal is \emph{committed} when it is full or in some other appropriate
for JFFS3 time. This means, that the indexing nodes corresponding to the
journal changes are updated and written to the flash. The committed
journal eraseblocks are then treated as leaf eraseblocks and new journal
eraseblocks are picked by JFFS3 using the common JFFS3
\mbox{wear-levelling} algorithm.

The journal makes it possible to postpone indexing information updates to later
and potentially more appropriate time. It also allows to merge many indexing
node updates and lessen the amount of flash write operations.

When JFFS3 file system is being mounted, the journal should be read, "replayed"
and the journal tree should be built. So, the larger is the journal, the longer
it may take to mount JFFS3. From the other hand, the larger is the journal, the
more writes may be deferred and the better performance may be achieved. By the
other words, there is a \mbox{trade-off} between the mount time and the
performance and one may vary these characteristics by means of changing the
size of the journal.

%
% GARBAGE COLLECTION
%
\subsection{Garbage collection}

Garbage collection is a vital part of any \mbox{log-structured} file system.
Over time, JFFS3 uses up all the flash free space and it needs to reclaim flash
space occupied by garbage. And the goal of Garbage Collector is to recycle
garbage and reclame flash space which it occupies. Since the only way to
reclaim it is to erase the whole eraseblock, Garbage Collector works in terms
of eraseblocks.

JFFS2 Garbage Collector is quite simple and works in several steps.

\begin{enumerate}

\item To reclaim dirt from an eraseblock, JFFS2 moves all valid nodes from this
eraseblock to another eraseblock.

\item As nodes have changed their positions, the JFFS2 \mbox{in-RAM} index is
adjusted.

\item The first eraseblock may be erased and \mbox{re-used}.

\end{enumerate}

Note, JFFS2 (and JFFS3) aways reserves several eraseblocks in order to
guarantee that there are always some free eraseblocks available to perform
garbage collection.

JFFS3 Garbage Collector is more complex. When valid data has been moved from an
eraseblock, the corresponding indexing nodes must be updated as well. Depending
on how many space Garbage Collector has reclaimed and how many space it has
spent to update indexing nodes, it might be that Garbage Collector produces
more garbage that it reclaims. This problem is demostrated in figure
\ref{ref_FigureGCProbl_01}.

%
% Figure wit hthe garbage collection problem
%
\begin{figure}[h]
\begin{center}
\begin{htmlonly}
\includegraphics{pics/gcprobl-01.png}
\end{htmlonly}
%begin{latexonly}
\includegraphics[width=120mm,height=96mm]{pics/gcprobl-01.pdf}
%end{latexonly}
\end{center}
\caption{JFFS3 garbage collection problem illustration.}
\label{ref_FigureGCProbl_01}
\end{figure}

There is a subtree with root in node $A$ depicted in the figure. At the
beginning (snapshot~1), leaf nodes $D$, $E$, $F$, and $G$ are situated in the
eraseblock number~1. Indexing nodes $A$, $B$, and $C$ are situated in the
eraseblock number~2. There are also two reserved eraseblocks. Suppose all nodes
have the same size equivlent to the size of sector which is 512 bytes in
this example.

At snapshot~2 JFFS3 has decided to reclaim 512 bytes of dirty space form
the eraseblock number~1. Garbage Collector moves all the valid nodes from the
eraseblock number~1 to one of the reserved eraseblocks. But as indexing nodes
$B$ and $C$ still refer old copies of the moved nodes in the eraseblock
number~1, this eraseblock cannot be erased so far. Indexing nodes $A$, $B$, and
$C$ have to be updated first.

At snapshot~3 Garbage Collector has updated indexing nodes $A$, $B$ and $C$,
putting them to one of the reserved eraseblocks. From now on, old copies of
nodes $A$, $B$, $C$, $D$, $E$, $F$, and $G$ at eraseblocks~1 and~2 comprise
garbage. The eraseblock number~1 was erased and is now free.

But unfortunatelly, the result is that Garbage Collector made more dirt that it
reclaimed space. Indeed, GC reclaimed 512 bytes while produced three times
greater amount of garbage (see the first three sectors at eraseblock~2,
snapshot~3). Compare snapshots~1 and~2.

Hence, it is obvious that garbage collection in JFFS3 is must be more complex
that in JFFS2. Chapter \ref{ref_SectionGC} discusses JFFS3 Garbage Collector in
details.

%
% THE SUPERBLOCK
%
\subsection{The superblock}

The JFFS3 \emph{superblock} is a data structure that describes the file system
as a whole and contains important information like the offset of the root node,
the journal eraseblocks, etc. When the file system is being mounted, it first
finds and reads the JFFS3 superblock.

In case of traditional file systems the superblock usually resides at a fixed
position on the disk and may be found very quickly. Conversely, due to the
"\mbox{out-of-place} write" flash property it is impossible to assign a fixed
position for the JFFS3 superblock. Things are getting even more complex because
of the need to provide good \mbox{wear-levelling}~-- it is incorrect to just
reserve several erasable blocks for the superblock unless it is guaranteed that
these eraseblocks will not be worn out earlier then the other eraseblocks.

We have the following two requirements that ought to be met in JFFS3:

\begin{itemize}

\item JFFS3 must be able to quickly find the superblock;

\item the superblock management techniques must not spoil the overall flash
wear levelling.

\end{itemize}

In the classical file systems the superblock usually contains a lot of static
data which is rarely updated and the superblock may have any size. In JFFS3,
the superblock must be updated quite often (e.g., each time the journal is
committed). This means that to lessen the amount of I/O, the JFFS3 superblock
should be as small as it is possible, namely, one sector. And there is no
reason to keep any static data in the superblock (e.g., the size of the file
system, its version, etc). For static data, JFFS3 reserves the first eraseblock
of the JFFS3 partition.

Thus, the following terms are used in this document:

\begin{itemize}

\item \emph{static superblock}~-- contains only static data which are never
changed by JFFS3; the static superblock resides at the \emph{static
eraseblock}; the static eraseblock is the first \mbox{non-bad} eraseblock of
the JFFS3 partition; it is supposed that the contents of the static eraseblock
may only be changed by external \mbox{user-level} tools;

\item \emph{superblock}~-- contains only dynamic data, is changed quite
often and requires special methods to deal with.

\end{itemize}

JFFS3 has a rather complicated superblock management scheme which makes it
possible to quickly find the superblock without full flash scanning when the
file system is being mounted. This scheme provides good flash
\mbox{wear-levelling}. The superblock lookup should take few milliseconds and
scale as $O(log_2(S))$. For more detailed information about the superblock
management scheme see section~\ref{ref_SectionSBAlg}.