Track of human polytracts

Description

This tract displays polytracts in the human reference genome. A polytract is defined as a tract of tandem mono-nucleotide, di-nucleotide, or tri-nucleotide repeats, termed MNR, DNR, and TNR, respectively. Each MNR and DNR span at least six units, and each TNR spans at least three units. Incomplete terminal motif is included in a polytract (so that the length of a DNR or TNR is not necessarily multiple of 2 or 3).

Display Conventions and Configuration

Four shades of grayness are used to distinguish four classes of data, including three clades of polytracts (MNR, DNR, and TNR) and the polytract hinge.

 Methods

A polytract is defined as a tract of mono-nucleotide, di-nucleotide, or tri-nucleotide tandemly repeated motifs, with possibly incomplete terminal motif included, where the minimum number of repeated units are 6 for MNR and 3 for DNR and TNR. Polytracts were identified through a string matching between a pattern and each chromosome sequence, with assistance from R packages “stringr” (https://CRAN.R-project.org/package=stringr) and “BSgenome.Hsapiens.UCSC.hg38” (www.bioconductor.org; DOI: 10.18129/B9.bioc.BSgenome). Accommodating the complementarity between a purine and pyrimidine pair (A:T and G:C), MNRs comprise A/T and G/C species, DNRs comprise TA, CT/GA, CA/GT, and GC species, and TNRs comprise AAC, AAG, AAT, ACC, GAC, ACT, CAG, AGG, ATC, and CGG species. That is, our resultant polytract dataset consisted of two types of MNRs, four types of DNRs, and ten types of TNRs (Table 1).

Table 1. Summary statistics of three clades of polytracts in the human reference genome (HG38).

polytract species

Tract number

Tract volume (nt)

Genome occupancy

Mean length

Median length

A/T

                   7,119,220

           55,290,931

1.79%

7.8

6

C/G

                      610,474

              3,839,875

0.12%

6.3

6

MNR total

                  7,729,694

           59,130,806

1.91%

7.6

6

TA

                   2,764,278

           19,948,282

0.65%

7.2

6

CT/GA

                   3,679,922

           25,030,351

0.81%

6.8

6

CA/GT

                   3,371,036

           25,125,048

0.81%

7.5

6

GC

                        71,160

                 476,554

0.02%

6.7

6

DNR total

                  9,886,396

           70,580,235

2.29%

7.1

6

AAT

                      353,551

              3,828,512

0.12%

10.8

9

ACC

                      238,068

              2,302,369

0.07%

9.7

9

AAG

                      183,579

              1,859,914

0.06%

10.1

9

AGG

                      183,232

              1,856,718

0.06%

10.1

9

AAC

                      146,444

              1,720,479

0.06%

11.7

10

CAG

                      131,235

              1,290,213

0.04%

9.8

9

ATC

                      123,051

              1,223,622

0.04%

9.9

9

ACT

                        36,083

                 352,774

0.01%

9.8

9

CGG

                        21,508

                 238,039

0.01%

11.1

10

GAC

                           1,396

                   13,915

0.00%

10.0

9

TNR total

                  1,418,147

           14,686,555

0.48%

10.4

9

 

Credits

Data were generated and processed in Guo Bioinformatics Lab at UNM Comprehensive Cancer Center. For inquiries, please contact Dr. Hui Yu (huiyu1@salud.unm.edu).

References

Yu H, Zhao S, Ness S, Kang H, Sheng Q, Samuels DC, Oyebamiji O, Guo Y. Non-canonical RNA-DNA differences and other human genomic features are associated with very short tandem repeatsPLoS Comp Biol. 2020 In revision.