This tract displays
polytracts in the human reference genome.
A polytract is defined as a tract of tandem mono-nucleotide, di-nucleotide, or
tri-nucleotide repeats, termed MNR, DNR, and TNR, respectively. Each MNR and
DNR span at least six units, and each TNR spans at least three units. Incomplete
terminal motif is included in a polytract (so that the length of a DNR or TNR
is not necessarily multiple of 2 or 3).
Four shades of grayness are used to distinguish four classes
of data, including three clades of polytracts (MNR, DNR, and TNR) and the polytract
hinge.
A polytract is defined as a tract of mono-nucleotide, di-nucleotide, or tri-nucleotide tandemly repeated motifs, with possibly incomplete terminal motif included, where the minimum number of repeated units are 6 for MNR and 3 for DNR and TNR. Polytracts were identified through a string matching between a pattern and each chromosome sequence, with assistance from R packages “stringr” (https://CRAN.R-project.org/package=stringr) and “BSgenome.Hsapiens.UCSC.hg38” (www.bioconductor.org; DOI: 10.18129/B9.bioc.BSgenome). Accommodating the complementarity between a purine and pyrimidine pair (A:T and G:C), MNRs comprise A/T and G/C species, DNRs comprise TA, CT/GA, CA/GT, and GC species, and TNRs comprise AAC, AAG, AAT, ACC, GAC, ACT, CAG, AGG, ATC, and CGG species. That is, our resultant polytract dataset consisted of two types of MNRs, four types of DNRs, and ten types of TNRs (Table 1).
Table 1. Summary
statistics of three clades of polytracts in the human reference genome (HG38).
polytract
species |
Tract
number |
Tract
volume (nt) |
Genome
occupancy |
Mean
length |
Median
length |
A/T |
7,119,220 |
55,290,931 |
1.79% |
7.8 |
6 |
C/G |
610,474 |
3,839,875 |
0.12% |
6.3 |
6 |
MNR
total |
7,729,694 |
59,130,806 |
1.91% |
7.6 |
6 |
TA |
2,764,278 |
19,948,282 |
0.65% |
7.2 |
6 |
CT/GA |
3,679,922 |
25,030,351 |
0.81% |
6.8 |
6 |
CA/GT |
3,371,036 |
25,125,048 |
0.81% |
7.5 |
6 |
GC |
71,160 |
476,554 |
0.02% |
6.7 |
6 |
DNR
total |
9,886,396 |
70,580,235 |
2.29% |
7.1 |
6 |
AAT |
353,551 |
3,828,512 |
0.12% |
10.8 |
9 |
ACC |
238,068 |
2,302,369 |
0.07% |
9.7 |
9 |
AAG |
183,579 |
1,859,914 |
0.06% |
10.1 |
9 |
AGG |
183,232 |
1,856,718 |
0.06% |
10.1 |
9 |
AAC |
146,444 |
1,720,479 |
0.06% |
11.7 |
10 |
CAG |
131,235 |
1,290,213 |
0.04% |
9.8 |
9 |
ATC |
123,051 |
1,223,622 |
0.04% |
9.9 |
9 |
ACT |
36,083 |
352,774 |
0.01% |
9.8 |
9 |
CGG |
21,508 |
238,039 |
0.01% |
11.1 |
10 |
GAC |
1,396 |
13,915 |
0.00% |
10.0 |
9 |
TNR
total |
1,418,147 |
14,686,555 |
0.48% |
10.4 |
9 |
Data were generated and processed in Guo Bioinformatics Lab
at UNM Comprehensive Cancer Center. For inquiries, please contact Dr. Hui Yu (huiyu1@salud.unm.edu).
Yu H, Zhao S, Ness S, Kang H, Sheng Q, Samuels DC, Oyebamiji
O, Guo Y. Non-canonical RNA-DNA differences and
other human genomic features are associated with very short tandem repeats. PLoS Comp Biol. 2020 In revision.