Laura F. Campitelli1,2*, Isaac Yellan1,2*, Mihai Albu2, Marjan Barazandeh2,3, Zain M. Patel1,2, Mathieu Blanchette4, and Timothy R. Hughes1,2§
1Department of Molecular Genetics, University of Toronto, Toronto, ON
2Donnelly Centre, University of Toronto, Toronto, ON
3Faculty of Pharmaceutical Sciences, University of British Columbia, Vancouver, BC
4Department of Computer Science, McGill University, Montreal, QC
§To whom correspondance should be addressed:
Abstract
Sequences derived from the LINE-1 (L1) family of retrotransposons occupy at least 17% of the human genome, with 67 of distinct subfamilies representing successive waves of expansion and extinction in mammalian lineages. L1s contribute extensively to gene regulation, but their molecular history is difficult to trace, because most are present only as truncated and highly mutated fossils. Consequently, L1 entries in current databases of repeat sequences are composed mainly of short diagnostic subsequences, rather than full functional progenitor sequences for each subfamily. Here, we have coupled two levels of sequence reconstruction (at the level of whole genomes, and L1 subfamilies) to reconstruct progenitor sequences for all human L1 subfamilies that are more functionally and phylogenetically plausible than existing models. Most of the reconstructed sequences are at or near the canonical length of L1s and encode uninterrupted ORFs with expected protein domains. We also show that the presence or absence of binding sites for KZFPs (KRAB-containing Zinc Finger Proteins), even in ancient reconstructed progenitor L1s, mirrors binding observed in human ChIP-exo experiments, thus extending the arms race and domestication model. RepeatMasker searches of the modern human genome suggest that the new models may be able to assign subfamily-resolution identities to previously ambiguous L1 instances. The reconstructed L1 sequences will be useful for genome annotation and functional study of both L1 evolution and L1 contributions to host regulatory networks.
Web Supplementary Files
Data Underlying Figures
- RepeatMasker scans of hg38 and ancestral genomes. (.zip 1.8Gb)
- Figure 4A.
- Source alignment of 54 composite sequences. (.fa)
- Tree produced using the alignment and FastTree. (.tree)
- Figure 4B.
- Source alignment of 67 Dfam L1 subfamily 3’ end models. (.aln)
- Tree produced using the alignment. (.tree)
- Figure 5.
- KZFP-TE enrichment p-values (from Barazandeh et al 2018). (.xlsx)
- KZFP-TE top 500 peak overlap (from Barazandeh et al 2018). (.xlsx)
- Figure 6.
- RepeatMasker .out file for the Composite Sequence custom library queried against hg38. (.gz)
- Figure S2.
- RepeatMasker scan .out file of hg38 (CG corrected Kimura Divergence values are in last column). (.zip)
- RepeatMasker scan .out file of the Progressive Cactus eutherian ancestral genome (CG corrected Kimura Divergence values are in last column). (.zip)
- RepeatMasker scan .out file of the Ancestors 1.1 eutherian ancestral genome (CG corrected Kimura Divergence values are in last column). (.zip)
- Figure S5.
- RepeatMasker scan .out files for Progressive Cactus simian and primate reconstructed ancestral genomes. (.zip)
- S5A: FASTA files containing Cactus genome-derived reconstructed sequences equivalent to the L1MA2, L1MA4, and L1MD1-3 best full-length sequences (zip archive of FASTA files).(.zip)
- S5B: FASTA files containing Muscle alignments of Cactus genome-derived full-length reconstruction input sequences. (zip archive of FASTA files). (.zip)
- Figure S6.
- S6A: Results of Conserved Domain scans of Cactus genome-derived full-length reconstructed sequences. (.txt)
- S6B-D: Character posterior probabilities of “best” full-length reconstructed sequences. (.zip)
- Figure S7.
- S7B-C:
- Results of Conserved Domain scans of translated initial full-length reconstructed sequences. (.txt)
- Results of Conserved Domain scans of translated reconstructed ORFs. (.csv)
- Figure S15.
- S15A:
- Source alignment of 67 composite sequences. (.afa)
- Tree produced using the alignment. (.tree)
- S15B-E:
- Source Muscle alignments for phylogenetic trees of reconstructed sequence components.
- Trees produced using above alignments.
- Figure S17.
- Unfiltered BLAST results of Composite Sequences queried against hg38. (.zip)
- BED file of L1 instances annotated using BLAST pipeline. (.bed)