Reduced complexity hard ‐ and soft ‐ input BCH decoding with applications in concatenated codes

Error correction coding for optical communication and storage requires high rate codes that enable high data throughput and low residual errors. Recently, different concatenated coding schemes were proposed that are based on binary BCH codes with low error correcting capabilities. In this work, low ‐ complexity hard ‐ and soft ‐ input decoding methods for such codes are investigated. We propose three concepts to reduce the complexity of the decoder. For the algebraic decoding we demonstrate that Peterson's algorithm can be more efficient than the Berlekamp–Massey algorithm for single, double, and triple error correcting BCH codes. We propose an inversion ‐ less version of Peterson's algorithm and a corresponding decoding architecture. Furthermore, we propose a decoding approach that combines algebraic hard ‐ input decoding with soft ‐ input bit ‐ flipping decoding. An acceptance criterion is utilised to determine the reliability of the estimated codewords. For many received codewords the stopping criterion indicates that the hard ‐ decoding result is sufficiently reliable, and the costly soft ‐ input decoding can be omitted. To reduce the memory size for the soft ‐ values, we propose a bit ‐ flipping decoder that stores only the positions and soft values of a small number of code symbols. This method significantly reduces the memory requirements and has little adverse effect on the decoding performance


| INTRODUCTION
Concatenated codes using BCH codes of moderate length, with low error correcting capability have recently been applied for error correction in optical communication as well as in storage systems.Such coding systems require very high throughput with hard-input decoding, but some applications also demand efficient soft-input decoding algorithms.These requirements can be met by product codes, half-product codes, staircase codes, or generalized concatenated codes.For instance, product code constructions based on BCH codes were proposed in [1][2][3].Moreover, generalized concatenated codes (GCC) with inner BCH codes were investigated in [4][5][6][7][8].
Due to the required code rates, BCH codes that can only correct single, double, or triple errors are used.The decoding of the concatenated codes typically uses multiple rounds of BCH decoding.Hence, the achievable throughput depends strongly on the speed of the BCH decoder.Moreover, BCH codes that correct only two or three errors are used in random access memory (RAM) applications [16][17][18], that need high data throughput and a very low decoding latency.In this work, we propose efficient methods for hard-and soft-input decoding for single, double, or triple error correcting BCH codes.We propose three concepts to reduce the complexity of the decoder hardware.
First, we consider the algebraic hard-input decoding.Algebraic BCH decoding consists of three steps: syndrome calculation, calculation of the error location polynomial, and the Chien search which determines the error positions.For BCH codes of moderate length (over Galois fields GF(2 6 ), …, GF(2 12 )), the syndrome calculation and the Chien search can be performed in parallel structures that calculate all syndrome values and all error positions within a single clock cycle, whereas the calculation of the error location polynomial is often performed using the Berlekamp-Massey algorithm (BMA) which requires several iterations.Alternatively, decoders based on Peterson's algorithm [19] were proposed in [9,10,20,21].Both algorithms determine the error location polynomial.However, we demonstrate that for BCH codes with small error correcting capabilities Peterson's algorithm can be more efficient in terms of decoding complexity than the BMA.We propose an inversion-less version of Peterson's algorithm and a corresponding decoding architecture for single, double, and triple error correcting BCH codes.This decoder is faster than the fully parallel BMA at a comparable circuit size. 1  Next, we investigate a hybrid decoding approach that combines algebraic hard-input decoding of binary block codes with soft-input decoding.In particular, an acceptance criterion is utilised which determines the reliability of a candidate codeword.This acceptance criterion was proposed in [8] for decoding concatenated codes with binary inner codes and outer Reed-Solomon (RS) codes.In particular, an error and erasure decoding of the outer RS codes was considered in [8], where the acceptance criterion improves the decoding performance.We demonstrate that this acceptance criterion can also be used as a stopping rule to reduce the decoding complexity.For many received codewords the stopping criterion indicates that the hard-decoding result is sufficiently reliable, and the costly soft-input decoding can be omitted.This hybrid decoding approach can enhance the decoder throughput for GC codes as proposed in [13].
Finally, we consider the hardware costs for the soft-input decoder.The hardware size of the soft-input decoder proposed in [13] is dominated by the memory for the soft-values.In order to reduce the memory requirements, we propose a bitflipping decoder that stores only the positions and soft values of the least reliable code symbols.The number of stored symbols depends on the minimum Hamming distance of the code.For high rate BCH codes, this method significantly reduces the memory requirements and has little adverse effect on the decoding performance.
The remaining article is organised as follows.In Section 2, we propose an inversion-less version of Peterson's algorithm for triple error correcting BCH codes and a pipelined decoder architecture.The bit-flipping decoding is described in Section 3, where we also analyse the stopping rule.In Section 4, we discussed applications of the proposed decoding concepts for concatenated codes and present an approach to reduce the memory requirements for soft values.Performance results and conclusions are provided in Sections 5 and 6, respectively.

| ALGEBRAIC BCH DECODER
In this section, we suggest an inversion-less version of Peterson's algorithm for triple error correcting BCH codes.This algorithm is more efficient than the decoders employing Galois field inversion [10,20].Moreover, the proposed algorithm provides more flexibility regarding the hardware implementation and enables pipelining to speed up the decoding.A decoding architecture for such a pipelined architecture is presented.Furthermore, we present conditions for decoding failure detection with extended BCH codes.If the actual number of errors exceeds the error correction capability of the BCH code, the decoder may fail to determine a valid codeword.However, such failures can be detected and this failure detection can be utilised for the decoding of concatenated codes [23].

| Peterson's algorithm
In this section, we briefly review Peterson's algorithm and introduce the notations.The received vector is r( is a codeword of length n and e(x) = e 0 + e 1 x + ⋯ + e n−1 x n−1 is the error vector.S 1 , S 2 , …, S 2t−1 denote the syndrome values which are defined as where α is the primitive element of the Galois field GF(2 m ).
For binary BCH codes, the following relation holds Let ν be the actual number of errors and t the error correcting capability of the BCH code.The coefficients of the error location polynomial σ(x) = σ 0 + σ 1 x + ⋯ + σ t x ν satisfy a set of equations called Newton's identities.In matrix form these equations are σ 0 = 1 and Δ i denote the vector of coefficients: Moreover, S i is the syndrome vector: The material in this paper was presented in part at the International ITG Conference on Systems, Communications and Coding, 2019 [22].
FREUDENBERGER ET AL.
Note that the matrix A i is singular for i > ν.Hence, Peterson's algorithm first calculates the number of errors ν.Starting with i = t the determinant D i = det(A i ) is calculated.If D i = 0 then the algorithm reduces the matrix A i until D i = det(A i ) ≠ 0 holds and Equation (3) can be solved.
Finally, the Chien search determines the error positions by searching for the roots of the error location polynomial.The calculation of σ(α i ) for i = 0, …, n − 1 can be conducted in parallel using simple logic operations [12,16].

| Calculating the error location polynomial for single, double, and triple errors
For single, double, and triple errors the following direct solutions of the Newton's identities follow [24,25]: These solutions are used in [10,20,26] for decoding BCH codes.The main difference between [10,20] is the implementation of the Galois field inversion in Equation (9).For instance, in [10], a parallel hardware implementation is proposed.This architecture requires only four Galois field multipliers, but additionally a Galois field inversion is required.The complexity and the throughput of this architecture are determined by the inversion.For the Galois field GF(2 10 ) the size of the inversion is about twice the size of a multiplier and the length of the critical path is four times longer than that of a multiplier.In [20], the inversion is implemented using a look-up table, which is only efficient for small Galois fields, because the table size is of order O(m2 m ).Even for moderate Galois field sizes, for example, m = 8, …, 12, such look-up tables are costly.In particular, if multiple instances of the decoder are required.Moreover, omitting inversion reduces hardware complexity and speeds up the calculation.
In the following, we propose an algorithm for triple errors that omits the Galois field inversion similar to the approach in [26] that considers double errors.First, we consider the case for single and double errors.Note that the roots of the error location polynomial do not change, if we multiply all coefficients with a non-zero factor.For instance, multiplying the right hand side of Equation ( 8) with S 1 ≠ 0, we obtain for ν = 2 with the determinant Note that for ν = 1 and ν = 2, S 1 is non-zero.For a single error in position i we have S 1 = α i ≠ 0. Similarly, for two errors in positions i and j, we have S 1 = α i + α j ≠ 0, because α i ≠ α j .Equation ( 10) is also a solution for ν = 1, because D 1 = S 1 ≠ 0 and D 2 = 0 holds for a single error.
Next, we consider the case ν ≥ 2. For ν = 2 and ν = 3, we have D 2 ≠ 0 [20].To see that, consider ν = 2.We have S 1 = α i + α j and S 3 = α 3i + α 3j .Hence, Similarly, for ν = 3, we have S 1 = α i + α j + α k and S 3 = α 3i + α 3j + α 3k .Consequently, The last term is the determinant of the following matrix: This matrix has full rank, because the columns are linearly independent.Hence, D 2 ≠ 0 holds for ν = 3.Now, multiplying the right hand side of Equation ( 9) by D 2 , we obtain a solution for ν = three: with and the determinant Using Equations ( 1), ( 11) and ( 16), we obtain The decoding procedure is summarised in Algorithm 1.This algorithm can easily be adapted to the decoding of single and double error correcting codes, for example setting D 3 = 0 for double error correcting BCH codes.This is important for the decoding of GC codes that use nested inner codes [12,13], where the error correcting capability increases from level to level.
In [13,23], concatenated codes based on extended BCH codes were presented, where the additional parity bit for the BCH codes can be used for failure detection after the algebraic BCH decoding.This failure detection can improve the decoding performance of the outer RS codes with error and erasure decoding.
The proposed decoding approach can be applied to extended BCH codes.Table 1 summarises the conditions for a valid syndrome S 0 of the parity check for extended BCH codes, where X denotes unused conditions.For instance, if the determinant D 3 is non-zero, the algebraic decoder assumes that three errors have occurred.An error pattern with an odd number of errors implies S 0 = 1.Hence, D 3 ≠ 0 and S 0 = 0 correspond to an uncorrectable error pattern and a decoding failure can be detected.Similarly, the condition D 2 ≠ 0 and D 3 = 0 indicates two errors, which implies S 0 = 0. Consequently, a failure is detected for D 2 ≠ 0, D 3 = 0, and S 0 = 1.

| Hardware architecture
In this section, we present a hardware architecture for the proposed algebraic decoding algorithm and compare its speed (critical path length) with the BMA.Note that the critical path length and the circuit size is dominated by the Galois field multipliers and Galois field inversion.For GF (2 m ), the size of a bit-parallel multiplier grows with order O(m 2 ) and the critical path with O(m).The Galois field inversion is often implemented using Fermat's little theorem, which requires only a single multiplier and a squaring operation, but m − 1 clock cycles [27].Hence, the total number of basic logic operations per inversion is of the order O(m 3 ).On the other hand, the addition and squaring operations over GF (2 m ) are of the order O(m) with a critical path length O(1).Consequently, these two operations are neglected in the following discussion.
Algorithm 1 can be implemented performing all operations in parallel.Such a parallel implementation requires four multipliers and has a critical path length of two multipliers.It is more efficient than the implementation proposed in [10], which uses three multipliers and one inversion.The logic for the inversion is about twice the size of a multiplier (for GF (2 10 )) and the critical path length of the inversion is equivalent to four multiplications.The total critical path in [10] has a length that is equivalent to six multiplications.Hence, at a smaller size the proposed algorithm has a significantly shorter critical path.
The Galois field inversion is an atomic operation which limits the efficiency of pipelining, whereas the proposed algorithm enables pipelined architectures that can speed up the decoding.Figure 1 presents a decoder pipeline (without control logic).In the first stage the variables D 2 and δ 2 are calculated according to Equations ( 11) and ( 16), respectively.The syndrome S 1 is stored, as it is required for the calculation of D 3 in the second stage.In the second pipeline stage D 3 is determined according to Equation (18).The final error location polynomial is provided as σ The pipeline requires four multipliers and additional registers (three registers of width m bits) to store intermediate results.The pipeline reduces the critical path length to a single multiplication.Hence, the pipelined architecture doubles the throughput compared with the structure without pipeline.Note that the fully parallel BMA also has a critical path length TA B L E 1 Conditions for the parity check for extended BCH codes of a single multiplication.However, the parallel BMA requires 2t multipliers, at least 2t registers and t iterations [28,29].Hence, the proposed architecture is smaller and about three times as fast as the parallel BMA.
To verify the above size considerations, the proposed decoding algorithm has been implemented on a field-programmable gate array (FPGA) in Verilog.Table 2 contains results for the Xilinx Virtex-7 FPGA, where throughput is presented in terms of codewords per second.The fundamental building blocks of an FPGA are flip-flops and the look-up tables (LUT).The size of the logic is represented by the number of LUT.Table 2 represents data for m = 8 and m = 12.Both decoders require 3m flip-flops for the registers.The size of the decoder for m = 12 is dominated by the four multipliers which require about 90% of the logic.The speed of the decoders is determined by the achievable clock frequency f clk .The circuit for m = 8 achieves a clock frequency f clk = 500 MHz, that is a throughput of 500 ⋅ 10 6 BCH codewords per second, because one codeword is processed per clock cycle.The latency is two clock cycles, that is 4 ns.Moreover, the decoder for m = 8 is about 1.5 times faster than the decoder for m = 12 which confirms that the critical path length is of order O(m).Similarly, the ratio of 2.2 for the logic size for m = 12 and m = 8 agrees well with the estimate O(m 2 ) for the circuit size.The results for the parallel BMA are estimates based on the required number of multipliers and registers.The actual size will be higher.The BMA requires three clock cycles per BCH codeword.Hence, the achievable throughput is 111 ⋅ 10 6 BCH codewords per second at a clock frequency of f clk = 333 MHz.

| SOFT-INPUT DECODING
In the section, we discuss soft-input decoding of BCH codes based on a bit-flipping procedure.Bit-flipping decoding of binary block codes was pioneered by chase [30].Chase introduced reliability information-based decoding procedures that generate a list of candidate codewords by flipping bits in the received word.The test patterns for the bit-flipping are based on the least reliable positions of the received word.For instance, the p-Chase algorithm decodes the 2 p binary n-tuples obtained by considering all possible values in the p least reliable positions [31].The algebraic hard-input decoding is applied for each test pattern.At the end, the most likely codeword is selected from this list of candidates.
In [32,33], acceptance criteria for bit-flipping decoding were proposed that aim on stopping the decoding, once the ML codeword is found.Such stopping rules aim on achieving near-ML decoding performance and reducing the complexity compared to chase decoding.We consider a similar acceptance criterion for each estimated codeword.This criterion aims on exploiting error and erasure decoding of the outer RS codes in a concatenated coding scheme [8].This scheme uses the p-Chase algorithm and the acceptance criterion to determine the reliability of the final codeword candidate.This acceptance criterion enables a trade-off between the error and the erasure probability.It was shown in [8], that the bit-flipping combined with error and erasure decoding can significantly improve the soft-input decoding performance compared with Kaneko's rule [32].It can even outperform ML decoding of the BCH code in combination with outer RS decoding, because the acceptance criterion can efficiently detect unreliable codewords.
In contrast to [8,13,31], we do not use a fixed number of decoding cycles.We propose a hybrid decoding approach that combines algebraic hard-input decoding of binary BCH codes with bit-flipping decoding.In particular, we use the acceptance criterion to determine the reliability of each candidate codeword.For many received words the stopping criterion indicates that the hard-decoding result is sufficiently reliable, and the costly bit-flipping decoding can be omitted.In cases where bit-flipping decoding is required, we propose a procedure with a maximum number of decoding cycles, and apply a decision rule to each candidate.The decoder can stop after each decoding round if the codeword fulfils the acceptance criterion.If no reliable codeword is found after the final decoding round an erasure is declared.

| Kaneko's acceptance criterion
Kaneko's rule is a sufficient condition that the ML codeword is found.However, we show that Kaneko's rule can be too restrictive.First, we briefly introduce notation and explain review Kaneko's acceptance criterion.Afterwards we discuss the new acceptance criterion.Abbreviations: BMA, Berlekamp-Massey algorithm; FPGA, field-programmable gate array.

288
- Consider a binary linear code B ðn; k; dÞ, where n is the codeword length, k is the dimension, and d denotes the minimum Hamming distance.We denote the Hamming distance between two binary vectors a, b by d H (a), (b).We assume transmission of codewords x ¼ ðx 1 ; …; x n Þ ∈ C over an additive white Gaussian noise (AWGN) channel with binary phase shift keying.Let y be the received vector.The hard decision vector b y Furthermore, we use the vector of soft-values z = (z 1 …, z n ) with z i = |y i |.The soft-values z are sorted in ascending order e z ¼ ðe The proposed criterion is similar to Kaneko's acceptance criterion for bit-flipping decoding [32] and uses the same metric.The bit-flipping metric between a codeword x and the received vector y is defined as Let m ¼ d H ðx 0 ; b yÞ be the Hamming distance between the codeword x 0 and b y.Furthermore, let x 0 denote the codeword found from a hard-input decoder with Hamming distance m 0 ¼ d H ðx 0 ; b yÞ.Kaneko's acceptance criterion is a sufficient condition for x 0 being the ML codeword [32].

| Acceptance criterion
In order to motivate the stopping rule and the hybrid decoding approach, we analyse Kaneko's acceptance criterion and some properties of the bit-flipping metric.Let p(y|x) be the conditional density function for receiving y after transmitting the codeword x over the AWGN channel.
Lemma 1 For the AWGN channel we have lðy; where x 0 and x 00 are two codewords.
Proof From the definition in Equation (20) follows: Cancelling all terms in both sums with x 0 i ¼ x 00 i results in lðy; The AWGN channel is memoryless.Hence, the probability of receiving the vector y given the codeword x is where p(y i |x i ) is the conditional density function for receiving y i given the symbol x i .Using this factorization, we obtain where (26) follows from cancelling all terms with x 0 i ¼ x 00 i .For Using Equations ( 24) and ( 27), we obtain Equation (22).
Using Lemma 1, we can derive some interesting properties of the bit-flipping metric.In the following proposition, x 0 denotes the ML codeword and x 00 the second best codeword.(b) follows from the fact that the bit-flipping metric is nonnegative and lðb y; yÞ ¼ 0.
To prove (c), we first consider the case where b y is a codeword, that is, m = 0 and l(x 0 , y) = 0.In this case the Hamming metric d H ðx 00 ; b yÞ ¼ d H ðx 00 ; x 0 Þ is at least d.Consequently the metric l(x 00 , y) is the sum over at least d values z i .This sum is lower bounded by lðx 00 ; yÞ ≥ ∑ d i¼1 e z i due to the ordering of the soft-values.Now let x 0 be a codeword with Hamming distance m to the vector b y.In this case, the codeword x 00 differs in at least d − m positions from b y.Hence the lower bound in Equation ( 30) follows.Similarly, Equation ( 29) follows from the ordering of the soft-values.
From inequality (30), it is seen that is a sufficient condition for x 0 being the ML codeword.In fact this is a special case of Kaneko's acceptance criterion according to (21).Kaneko's acceptance criterion is a sufficient condition for x 0 being the ML codeword.However, Equation ( 21) is not a necessary condition.In some cases Kaneko's rule is too restrictive which is shown in the next proposition. holds.
Proof The lower bound in Equation ( 29) depends on the Hamming distance m of the ML codeword.From this lower bound it is observed that Kaneko's acceptance criterion can only be fulfilled, if holds.To obtain an upper bound on m, we assume that m takes on the maximum value: Now we bound the term d − ⌊ mþm 0 2 ⌋.Due to From Proposition 1 follows that m 0 ≥ 1 if b y is not a codeword.Hence, we have From which we obtain the upper bound in Equation (32).
It was shown by chase in [30] that bit-flipping decoding can achieve the near-ML performance and can correct up to d − 1 errors.Consequently, the metric of the ML codeword can be the sum over m = d − 1 reliability values which is much larger than that indicated in Equation (32).
In order to improve the detection rate, we use the acceptance criterion proposed in [8]: where M and T are parameters that enable a trade-off between the decoding performance and the number of test patterns.The acceptance criterion is based on Yamamoto and Itoh's decision rule [34], which is a simplification of the Forney's optimal criterion on the trade-off between erasure and error probability [35].It was shown in [8] that this criterion is a good estimation of the acceptance criterion proposed by Yamamoto and Itoh.However, to obtain the best possible trade-off parameters, these parameters must be determined by Monte Carlo simulation.We utilise the acceptance criterion (37) for a hybrid decoding approach.In the first decoding step, the syndrome is checked.According to Proposition 1, we stop the decoding if the received vector is a valid codeword.In the second stage, algebraic hard-input decoding is applied based on the hard decision vector b y.This step may result in an estimated codeword or a decoding fail as mentioned in Section 2. For instance, the error location polynomial indicates an odd number of errors and the parity bit indicates an even number.Hence, the solution of the error location polynomial is inconsistent.In case of a coding failure, we proceed with bit-flipping decoding.Inconsistent solutions are not considered as candidate codewords.After all algebraic decoding steps the reliability of the estimated codeword is tested according to Equation (37).The decoding is stopped if the criterion is fulfiled.Otherwise, the decoding process enters the next decoding step.In contrast to [8], we test the estimated codeword after each decoding iteration.
In order to demonstrate that the proposed decoding methods achieve a significant reduction of the decoder complexity, we consider the decoding of GC codes with BCH codes as component codes.First, we review the GC construction and its parameters.A detailed discussion can be found in [5,13,36,37].Afterwards we design a GC code for comparison with a polar code construction proposed in [38] for flash memories.In the next section, we present simulation results for this code with different decoding strategies of the inner BCH codes.

| GC encoding
The GC codeword is organised in an n b � n a matrix as depicted in Figure 2. All component codes are constructed over GF (2 m ).The parameters n a and n b denote the lengths of the outer codes A ðlÞ and inner codes B ðlÞ , respectively.The encoding starts with the outer codes.The rows of the codeword matrix are protected by L Reed-Solomon (RS) codes of length n a , where L denotes the number of levels.m elements of each column represent one symbol from the Galois field GF (2 m ).Hence, m rows of the matrix belong to a codeword of an outer code A ðlÞ ; l ¼ 0…; L − 1.Note that the code rate of the outer codes increases from level to level.The outer codes protect Lm rows of the matrix.The code dimensions are k The remaining n b − Lm rows are used for the redundancy of the inner codes or for information bits without outer encoding.Let k u denote the number of rows without outer encoding.Then, the dimension of the GC code is After the outer encoding, the columns of the codeword matrix are encoded with binary BCH codes B ðlÞ of length n b .In the simulations, we consider only extended BCH codes as inner codes.Each column of the codeword matrix is the sum of L codewords of nested extended BCH codes.
Hence, a higher level code is a sub-code of its predecessor, where the higher levels have higher error correcting capabilities, that is, t b,L−1 ≥ t b,L−2 ≥…≥, t b,0 , where t b,l is the error correcting capability of level l.The codeword of the jth column is the sum of L codewords.
These codewords b ðlÞ j are formed by encoding the symbols a j,l with the corresponding sub-code B ðlÞ , where a j,l is the j-th symbol (m bits) of the outer code A ðlÞ .For this encoding (L − l − 1)m zero bits are prefixed onto the symbol a j,l .Note that the every column b j of the matrix is a codeword of B ð0Þ , because of the linearity of the nested codes.

Example 1
In this example, we construct a code for a comparison with a polar code [38] for applications in flash memories.The comparison will be discussed in Section 5.The polar code is suitable for a block length of 1 kilobyte.Similarly, a GC code with code rate R = 0.875 and a code length of approx.8192 is constructed.This code is designed according to the rules proposed in [8] to achieve a word error probability 10 −5 at a channel bit error rate of 0.008.All component codes are constructed over GF(2 7 ).
The outer RS codes have a length of n a = 89.The inner codes are binary extended BCH codes of length n b = 92.For this GC code, we use L = 12 levels, where only the first six levels are protected by outer RS codes.Table 3 summarises the parameters of the GC code, where we use the same RS code from level 6 to level 11.To reduce the complexity, only the first three decoding levels use soft-decoding.The fourth level is protected by the code of the third level.The last six levels do not require error correction of the outer decoder, because the inner decoder guarantees the required word error probability.
In Table 3, k b,l and d b,l are the dimension and minimum Hamming distance of the binary inner codes at level l. k a,l and d a,l are the dimension and minimum Hamming distance of the outer RS codes, respectively.Overall, the code has dimension -291

| GC decoding
Figure 3 illustrates the GC decoding steps, where the decoder processes level by level starting with l = 0. Let l be the index of the current level.First the columns are decoded with respect to the BCH code B ðlÞ .After inner decoding, the information bits of the inner codes have to be inferred (re-image) in order to retrieve the code symbols a j,l of the outer RS code A ðlÞ , where j is the column index.If all symbols of the code A ðlÞ are inferred the RS code can be decoded.Error and erasure decoding is used in all levels of the RS code.After RS decoding, a partial decoding result b a l is available.This result has to be re-encoded using B ðlÞ .The estimated codewords of the inner code B ðlÞ are subtracted from the codeword matrix before the next level can be decoded.The detailed encoding and decoding process is described in [13].
In this work, we consider two modifications of the decoder proposed in [13] to reduce the decoding complexity.We apply the acceptance criterion according to Section 3 as a stopping criterion after each decoding iteration.Furthermore, we reduce the memory required for storing LLR values.
The modifications are illustrated in the decoder architectures in Figure 4.The first structure corresponds to the decoder presented in [13].The second structure depicts the new decoder architecture.The first stage in both decoders is a mapping function that maps the quantized input value to LLR values depending on the estimated channel error probability.For the soft-decoding, these LLR values have to be stored to determine the least reliable positions.We utilise this sorting to reduce the number of soft values that have to be stored.
In [13], the LLR values for all code symbols and index values that indicate the positions of the least reliable symbols are stored.These positions are needed to determine the bit-flipping patterns for the soft-input decoder as indicated by the block pattern generation in Figure 4.The corresponding LLR values are required for the metric calculation, which determines the sum over the magnitudes of LLR values corresponding to all altered positions according to (20).The positions altered by the bit-flipping patterns can be determined by sorting the soft-values, but the positions altered by the algebraic decoder are only known after each algebraic decoding step.Hence, all LLR values are stored in [13].In order to reduce the RAM requirements, we store only the positions and the LLR values corresponding to the least reliable values, where we increase the search depth for the LLR values.This increases the size of the logic for the sorting, but enables a compression of the soft-values.This concept utilises the fact that symbols with low reliability are more likely to be altered by the algebraic decoder.Moreover, quantized input and LLR values reduce the possible variations in the LLR values.For the considered high-rate BCH codes, most input values are mapped to LLR values with maximal magnitude.Only a small fraction of input symbols will have a small magnitude.In [13], the search depths are determined by the maximum number of flipped bit positions p for chase decoding.We store only the LLR values and their positions for the μ > p least reliable symbols.If the algebraic decoder alters symbols within the stored positions, we can calculate the correct metric value.For positions that are not stored, we set the LLR value to the maximum possible magnitude.With a sufficiently value of μ, this modification typically results in the correct metric according to (20).However, occasionally the metric calculation and the codeword selection will be affected.This leads to a small performance loss as we will see in Section 5.

Example 2
To demonstrate the size reduction for the memory, we consider the BCH codes of length n b = 92 as presented in Example 1.For these codes, only the μ = 10 smallest LLR values are needed to achieve good simulation results, where we consider quantized soft information with six sensing levels as in [38] With the proposed approach, we store m bits for each position and (n LLR − 1) bits for the corresponding magnitude.This results in a total of bits.In this example, the memory for soft values can be reduced by 72%.

| SIMULATION RESULTS
In this section, we present simulation results for the GC code over an AWGN channel with six sensing levels as in [38].First, we consider the performance of the BCH codes according to Example 1.
The left-hand side of Figure 5 shows different error rates for the codes of the first level l = 0.For the inner BCH code (92,84,4), the word error rate and failure (erasure) probability are depicted with Kaneko's stopping rule (solid lines) and the proposed acceptance criterion (dashed lines).Moreover, word error rates for the outer RS code with error and erasure decoding are presented.In this case, Kaneko's rule results in similar error and erasure probabilities.However, the RS code can correct more erasures than errors.Hence, reducing the error rate at the price of a higher failure rate can improve the performance of RS decoding.This can be observed from the word error rate of the RS code.The new stopping rule reduces the RS word error rate by one order of magnitude at high channel error rates.Note that the new stopping rule does hardly affect the complexity of the decoding.This is demonstrated in the righthand side of Figure 5, which depicts the percentage of decoding rounds that are stopped after the indicated iteration.Similar results are presented for the codes of the third level (l = 2) in Figure 6.In this case, the performance of the RS code is improved by two orders of magnitude and the decoding complexity is slightly reduced compared with Kaneko's stopping rule.Due to the good error correction capability of this level, 95% to 99% of all received words in the third level can be decoded with hard-input decoding.As the error correction capability increases from level to level, the probability of soft decoding decreases from level to level.Hence, we omit softinput decoding for the higher levels l > 2.
Figure 7 shows the performance of the overall GC code with the different decoding methods.In particular, we use two sets of trade-off parameters for the proposed criterion.These parameters are summarised in Table 4.One set is used for the decoding without premature stopping.In this case, all 2 p iterations of the bit-flipping decoder are used and the criterion according to (37) is used as an acceptance criterion for the final candidate codeword as proposed in [8].The second set is used for the decoding with premature stopping.All parameters were obtained by simulation thus optimising the performance of the outer RS code with error and erasure decoding.
All simulations were performed for two storage schemes for the LLR values.For the curves denoted by μ = 92, all LLR values were stored, whereas μ = 10 indicates results with reduced memory size.As anticipated the setup without premature stopping and without memory restriction has the best performance.However, differences occur only at higher channel bit error rates.For the target error rate of the design at 0.008, there is only a very small performance loss with the proposed decoding methods.For comparison, we present results for a polar code with similar parameters and with softinput decoding as reported in [38].The GC code has a significantly better performance.
Finally, we consider the complexity reduction with the stopping criterion compared with the decoding, as proposed in [13].The corresponding simulation results are depicted in Figure 8.The figure depicts the average number of decoding steps for the inner BCH code with and without stopping (as well as with and without memory restriction) for the first 3 GC levels.Note that all levels have the same complexity with the decoding according to [13].With the proposed stopping rule, the number of decoding iterations is reduced by a factor 3 − 4 for the first level; for the higher levels the complexity is reduced by a factor 4−6 depending on the channel error rate.
The memory reduction does hardly affect the number of iterations.The proposed soft-input decoder achieves nearly the performance of the hard-input decoder reported in [13], that is, the overall soft-input decoding process is twice as fas as the decoding in [13].

| CONCLUSION
In this work, we have proposed different methods to reduce the decoding complexity for binary BCH codes.We have investigated low-complexity hard-and soft-input decoding methods for single, double, and triple error correcting BCH codes.The algebraic decoder and the stopping rule for the bitflipping decoder can significantly speed up the decoding process.Furthermore, a concept was discussed to reduce the memory requirements for soft-values.
We have discussed the proposed decoding methods in the context of generalized concatenated codes which are constructed from inner BCH codes and outer RS codes.Such codes are used for error correction in flash memories [8,13].We have demonstrated that the decoding can be accelerated and the memory size reduced with hardly any impact on the overall decoding performance.The memory size could be reduced further if the soft values are compressed additionally by exploiting the different probabilities of occurrence.
The proposed concepts can be useful for other concatenated codes, for example, product codes, half-product codes, staircase codes.However, these concepts are limited to the decoding of BCH codes with small error correction capability.Peterson's algorithm is only useful for single, double, and triple error correcting BCH codes.For t ≥ 4 we were not able to find a solution, where Peterson's algorithm has a lower complexity than the BMA.Moreover, the Chase-2 decoder requires up to 2 d−1 test patterns, that is, the worst case complexity increases exponentially with minimum Hamming distance d of the code.Hence, the proposed bit-flipping decoding was also limited to codes with low error correction capability.

Algorithm 1
Inversion-less Peterson algorithm F I G U R E 1 Hardware architecture of the decoder pipeline FREUDENBERGER ET AL.

Proposition 1 (
a) The bit-flipping metric is a maximum likelihood metric.(b) If the hard decision vector b y is a codeword, then x 0 ¼ b y is the ML codeword.(c) Let m ¼ d H ðx 0 ; b yÞ be the Hamming distance between x 0 and b y.Then the metrics of the best and second best codeword satisfy lðx 0 ; yÞ ≥ ∑ Proof a) follows directly from Lemma 1.The bit-flip-ping metric can be used for maximum likelihood decoding, because l(y, x 0 ) ≤ l(y, x 00 ) implies p(yjx 0 ) ≥ p (yjx 00 ).

FREUDENBERGER ET AL.
8188, and rate R = k/n = 0.8754.F I G U R E 2 Codeword matrix TA B L E 3 Parameters of a generalized concatenated code of length 8188 and rate R = 0

F I G U R E 3
Generalized concatenated decoding schemes (b) F I G U R E 4 Two soft decoder structures 292 -FREUDENBERGER ET AL.

F I G U R E 5
Comparison of BCH(92,84,4) soft decoding using (31) (solid line) and the proposed threshold (37) (dashed line) as the stopping criterion F I G U R E 6 Comparison of BCH(92,70,8) soft decoding using Equation (31) (solid line) and the proposed threshold (37) (dashed line) as the stopping criterion FREUDENBERGER ET AL.
. The quantized input values are mapped to LLR values with n LLR = 5 bits each.One bit is the hard decision bit and four bits indicate the magnitude.To store all soft values, n all ¼ n a ⋅ n b ⋅ ðn LLR − 1Þ ¼ 8 188 ⋅ 4 ¼ 32 752 bits have to be stored.Additionally, n a ⋅ m ⋅ p = 1869 bits are required to store symbol positions.

F I G U R E 7 = 4 ; T = 2 M = 3 ; T = 1 1 M = 6 ; T = 2 M = 5 ; T = 2 M = 8 ; T = 2 = 7 ; T = 1 F I G U R E 8
Word error rate for different decoding methods for the GC code (n = 8188 and k = 7168) in comparison with a polar code (n = 8192 and k = 7168) from[38] Average number of decoding steps for the inner BCH code with and without stopping 294 -FREUDENBERGER ET AL.