Logical Disk Layout

Logical Disk Layout:

We have logical disks, which can hold either data, checksums, or be idle. Each logical disk is identical in capacity to the others. The figure below shows m+e+f logical disks. We assume that each logical disk has a logical block size. A logical stripe consists of a single block from each logical disk, such that the block number of this block is the same across each disk.(The figure below shows N+1 stripes starting with stripe 0 at the bottom). The logical stripe number is equal to this unique block number. Thus logical stripe 77 has logical block number 77 from each of the data disks, checksum disks, and idle disks. For simplicity, we make logical disk block numbers equal to phyiscal disk block numbers. Here N is the number of blocks on a logical disk, which is equal to the number of block on a physical disk.

Block 

Stripe 

N-1

N-1

N-1

N-1

N-1

N-1

N-1

N-1

N-1

N-1

N-1 

...  ...  ...  ...  ...  ...  ...  ...  ... 

... 

... 
...  ...  ...  ...  ...  ...  ...  ...  ... 

... 

... 
1 1 1 1 1 1 1 1 1

1

0 0 0 0 0 0 0 0 0

0

LD0 

LD1

... 

LDm-1

LC0

...

LCe-1 

LS0 

... 

LSf-1 

Logical
Drive 

LDi is the i-th logical data disk, i.e., a logical disk that stores only user data. There are 'm' such disks, numbered 0 through m-1. LCi is the i-th check or correction disk, i.e., a logical disk that stores the i-th erasure code for data in logical disks LD0 through LDm-1. For about 10,000 total data disks: 40 pods of 250 data disks each, 3 correction disks per pod gives us an MTTF of 50,000 years for the 10,000 disks.

LSi is the i-th spare disk. The blocks on the spare disks are used to store the data and correction blocks from failed disks. When a spare block is used to store data from a failed disk, we call the spare block an "active spare", otherwise the spare block is called "idle spare". Spare disks in a particular stripe always holds data from the same stripe.

Physical Disk Role Assignment:

We refer to "data", "correction", and "spare" as roles assigned blocks of physical disks. We would typically like to assign a group of consecutive logical stripes the same casting. We call such a group of stripes a troupe.

The number of stripes in a troupe is called the troupe size, TS.

The figure below shows m+e+f physical disks, named Pi roles assigned to the blocks of a troupe, assuming that roles are rotated left on every change of troupe. The assignment of a role to a block is called a casting. The troupe number of block b (either physical or logical) is b div TS.

If the troupe size is too small, during reconstruction, many discontiguous blocks will have to be read in order to perform computations using the same inversion matrix. Alternatively, a large contiguous read will require multiple inversion matrices. On the other, too large a troupe size, will cause roles to change infrequently within a single disk and therefore cause imbalanced utilization of resources.

This suggests TS to be just large enough to get full disk bandwidth when reading a troupe for reconstruction. If we set TS to 1000, each troupe occupies 1/2 MB on each disk, and a disk contains roughly 200,000 troupes.

               
               
               
               
               
               
               
               
Troupe Stripe 

LD0 

LD1 

... 

LDm-1 

LC0 

... 

LCe-1 

LS0 

... 

LSf-1 

2m+2e+2f  ... 
 LSf-1  LD0  

...

 LDm-2  LDm-1  ...  LCe-2  LCe-1  ...  LSf-2 2m+2e+2f-1  ... 
...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ... 
 LS0  LS1  

...

 LDm-f-1  

LDm-f

 ...  LDm+e-f-1  

LDm+e-f

...   LCe-1 ... 
 LCe-1  

LS0

 ... LDm-f-2 LDm-f-1 ...  LDm+e-f-2  LDm+e-f-1  

...

LCe-2 2m+2e+f-1  ...

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

... 

 LC0  LC1  ...  LDm-e-f-1  LDm-e-f  

...

 LDm-f-1  

LDm-f

 

...

 

LDm-1

 2m+e+f ... 
 

LDm-1

 

LC0

 ...  

LDm-e-f-2

 

LDm-e-f-1

 ...  

LDm-f-2

 LDm-f-1  ...  

LDm-2

 2m+e+f-1  

...

... 

...

 

...

...

...

...

...

...

...

...

...

... 

 

LD2

 

LD3

 

...

 

LC1

 LC2  

...

 

LS1

 

LS2

 

...

 

LD1

 m+e+f+2  

...

 LD1  LD2  

...

 

LC0

 LC1  

...

 

LS0

 

LS1

 

...

 LD0  m+e+f+1  

...

 

LD0

 

LD1

 

...

 

LDm-1

 

LC0

 

...

 

LCe-1

 

LS0

 

...

 

LSf-1

 m+e+f  ...
 LSf-1  

LD0

 

...

 

LDm-2

 LDm-1  

...

 LCe-2  LCe-1  ...  

LSf-2

 m+e+f-1  ...

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 
LS0 LS1  ... LDm-f-1  LDm-f

...

 LDm+e-f-1  LDm+e-f  ...  LCe-1 m+e

...

 LCe-1  

LS0

 ...  LDm-f-2  

LDm-f-1

 

...

 LDm+e-f-2  LDm+e-f-1  ...  LCe-2  m+e-1  ...
...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ... 
 

LC0

 

LC1

 

...

 

LDm-e-f-1

 

LDm-e-f

 

...

 LDm-f-1  LDm-f  

...

 

LDm-1

 m  

...

 LDm-1  

LC0

 

...

 LDm-e-f-2  

LDm-e-f-1

 ...  

LDm-f-2

 

LDm-f-1

 

...

 LDm-2  m-1 ... 
...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ... 
 LD2  LD3  

...

 

LC1

 LC2  ...  LS1  LS2  ...   

LD1

 2 2TS – 3TS-1
 LD1  LD2  

...

 

LC0

 LC1  ...  

LS0

 LS1  

...

 LD0  1  TS – 2TS-1
 

LD0

 LD1  ...  LDm-1  

LC0

 ...  LCe-1  

LS0

...

 
 LSf-1  0  0 – TS-1
 P0  P1  ...

Pm-1

 Pm

...

Pm+e-1

 Pm+e

...

Pm+e+f-1

Physical drive

 

 

 

 

 

Mapping User's Logical Blocks to Logical Disks:

User's block number x is mapped to block number (x div m) on logical disk (x mod m). Here x goes from 0 to m*N - 1.

Mapping Logical Disks and Block numbers to Physical Disks and Block Numbers:

A logical disk and block pair is of the form , where b is in [0,N) and d is in [0, m + e + f). Note that d in the range [0, m) refers to logical *data* disks LDd, whereas, d in the range [m, m+e) refers to logical *correction* disks LCd-m, and d in the range [m+e, m+e+f) refers to logical *spare* disks LSd-m-e.

A logical disk block pair of the form maps to physical block b and physical disk (d - (b div TS)) mod (m + e + f), where TS is the troupe size.

The chart above shows assignments, assuming that e+f << m, and that e << f.

Reconstruction after physical disk failure:

When a physical disk dies, we lose the data stored on that disk in the role (data, correction, or spare) assigned to it in all troupes. For example, in Figure 2 above, if Pm-1 dies, we lose correction blocks from troupes 1, 2, ..., e and repeating every m+e+f troupes thereafter, data blocks from troupes 0, e+f+1, e+f+2, ..., e+f+m, and repeating every m+e+f troupes thereafter, and spare blocks from troupe e+1, e+2, ..., e+f, and repeating every m+e+f troupes thereafter. Note that losing idle spare blocks does not cause us to lose data but reduces the capacity of spare blocks. Losing active spare blocks on the other hand requires us to find idle spare blocks elsewhere to move the data to.

Reconstruction is done one troupe at a time. Fix troupe T, then pick an LSi, such that

  1. the physical disk to which LSi is assigned in troupe T is alive.
  2. i is minimized
  3. LSi is an idle spare in troupe T.

To see how these steps work in practice, work through an example, assume Pm-1 were the very first disk to die. For troupe 0 data, we would choose LS0 residing on Pm+e, which would be then become an active spare playing the role of LDm-1. In troupe 1, we would pick LS0, residing on physical disk Pm+e-1, to become an active spare playing the role of LC0. In troupe e, since the role cast Pm-1 was LS0, and is therefore idle spare, we don't need to do anything. In general, a failed disk will have active spare blocks which contain data or correction values. For the purspose of reconstruction, these active spare blocks are treated as though they were data blocks or correction blocks, and recovered according to the following algorithm.

Having picked the destination for the recovered data for troupe T, we can now compute the recovered data. We are going to recover the data for all stripes in the troupe T. We first locate the appropriate physical disks for troupe T so that we can read a total of m data or correction blocks for each stripe in T. The blocks are referred as the recovery sources. As a practical consideration, we prefer to read data blocks instead of a correction block, and correction blocks in LC0 over other correction blocks.

We need to determine the roles (data or correction) of the m blocks read in the previous paragraph. Given these roles and the roles of the blocks to be recovered, we can construct vectors of coefficients, one for each block to be recovered. The calculation of the coefficients vary depending on the roles of the available blocks and the roles of the blocks to be recovered. (The computation of the vectors is described in a later paragraph.) We do a dot product of the available blocks and the vectors sending the result to the appropriate destination disk. Once the reconstruction is finished for all stripes in this troupe

Computing the vectors:

We first get the encoding matrix that produces the m recovery sources from the m data disks, which may not be presently available. Notice that this matrix is a full rank minor of the complete encoding matrix. We call the complete encoding matrix E. We refer to the full rank minor matrix S. We take the inverse of the S matrix to produce R, the recovery matrix.

Note that E, S, and R are mostly permutations of the identity matrix, so they can be efficiently represented and inverted.

[s0 s1 ... sm-1]R = [d0 d1 ... dm-1]

[d0 d1 ... dm-1]E = [d0 d1 ... dm-1 c0 c1 ... ce-1]

So,

[s0 s1...sm-1] R*E = [d0 d1 ... dm-1 c0 c1 ... ce-1]

Notice that the columns of R*E are the unit vectors except where corresponding element is missing in the surviving sources. Since R*E recovers all the data and correction values, it may of recover more data or correction values than were originally missing. We can therefore discard the columns of R*E corresponding to all surviving blocks.

As a special case, if only correction disks are missing, S and therefore R are exactly the identity matrix.

As a second special case, if only a single data disk is lost, we will have a check disk whose encoding column is all 1s (i.e., the parity disk) and the recovery procedure is a simple XOR and does not require a matrix inversion or element multiplication. This is identical to the case of RAID5.