N50, L50, and related statistics

In computational biology, N50 and L50 are statistics of a set of contig or scaffold lengths. The N50 is similar to a mean or median of lengths, but has greater weight given to the longer contigs. It is used widely in genome assembly, especially in reference to contig lengths within a draft assembly. L50 is the number of contigs whose summed length is N50. There are also the related N90, NG50, and D50 statistics.

Definition

N50

N50 statistic defines assembly quality. Given a set of contigs, each with its own length, the N50 length is defined as the shortest sequence length at 50% of the genome. It can be thought of as the point of half of the mass of the distribution; the number of bases from all contigs longer than the N50 will be close to the number of bases from all contigs shorter than the N50. For example, 9 contigs with the lengths 2,3,4,5,6,7,8,9,and 10, their sum is 54, half of the sum is 27, the size of the genome also happens to be 54. 50% of this assembly would be 10 + 9 + 8 = 27 (half the length of the sequence). Thus the N50=8, which is the size of the contig which, along with the larger contigs, contain half of sequence of a particular genome. Note: When comparing N50 values from different assemblies, the assembly sizes must be the same size in order for N50 to be meaningful.

L50

Given a set of contigs, each with its own length, the L50 count is defined as the smallest number of contigs whose length sum produces N50. From the example above the L50=3.

N90

The N90 statistic is less than or equal to the N50 statistic; it is the length for which the collection of all contigs of that length or longer contains at least 90% of the sum of the lengths of all contigs, and for which the collection of all contigs of that length or shorter contains at least 10% of the sum of the lengths of all contigs.

NG50

Note that N50 is calculated in the context of the assembly size rather than the genome size. Therefore, comparisons of N50 values derived from assemblies of significantly different lengths are usually not informative, even if for the same genome. To address this, the authors of the Assemblathon competition derived a new measure called NG50. The NG50 statistic is the same as N50 except that it is 50% of the known or estimated genome size that must be of the NG50 length or longer. This allows for meaningful comparisons between different assemblies. In the typical case that the assembly size is not more than the genome size, the NG50 statistic will not be more than the N50 statistic.

D50

The D50 statistic (also termed D50 test) is similar to the N50 statistic in definition though it is generally not used to describe genome assemblies. The D50 statistic is the lowest value d for which the sum of the lengths of the largest d lengths is at least 50% of the sum of all of the lengths.[1]

Examples

Consider two fictional, highly simplified genome assemblies, A and B, that are derived from two different species. Assembly A contains six contigs of lengths 80 kbp, 70 kbp, 50 kbp, 40 kbp, 30 kbp, and 20 kbp. The sum size of assembly A is 290 kbp, the N50 contig length is 70 kbp because 80 + 70 is greater than 50% of 290, and the L50 contig count is 2 contigs. The contig lengths of assembly B are the same as those of assembly A except for the presence of two additional contigs with lengths of 10 kbp and 5 kbp. The size of assembly B is 305 kbp, the N50 contig length drops to 50 kbp because 80 + 70 + 50 is greater than 50% of 305, and the L50 contig count is 3 contigs. This example illustrates that one can sometimes increase the N50 length simply by removing some of the shortest contigs or scaffolds from an assembly.

If the estimated or known size of the genome from the fictional species A is 500 kbp then the NG50 contig length is 30 kbp because 80 + 70 + 50 + 40 + 30 is greater than 50% of 500. In contrast, if the estimated or known size of the genome from species B is 350 kbp then it has an NG50 contig length of 50 kbp because 80 + 70 + 50 is greater than 50% of 350.

Alternate computation

N50 can be found mathematically for a list L of positive integers as follows:

  1. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself.
  2. The median of L' is the N50 of L. (The 10% quantile of L' is the N90 statistic.)

For example: If L = (2, 2, 2, 3, 3, 4, 8, 8), then L' consists of six 2's, six 3's, four 4's, and sixteen 8's. That is, L' has twice as many 2s as L; it has three times as many 3s as L; it has four times as many 4s; etc. The median of the 32-element set L' is the average of the 16th smallest element, 4, and 17th smallest element, 8, so the N50 is 6. We can see that the sum of all values in the list L that are smaller than or equal to the N50 of 6 is 16 = 2+2+2+3+3+4 and the sum of all values in the list L that are larger than or equal to 6 is also 16 = 8+8. For comparison with the N50 of 6, note that the mean of the list L is 4 while the median is 3.

Contradictory definitions

Some contradictions in the definition(s) of the N50 value have been identified, as discussed in a thread on the SEQ Answers forum. Also see

References

  1. Han, J.; Sanders, C. M.; Wang, C.; Yang, Q.; Wimbish, J.; Boone, B. E.; Thomas, S. J.; Levy, S.E. (25 September 2012). Measurement of T cell repertoire diversity in the peripheral blood by novel multiplex PCR and high-performance sequencing methods. MipTec. Basel Switzerland.

See also

This article is issued from Wikipedia - version of the 10/21/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.