Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics

YUTAKA KURODA; KAZUTOSHI TANI; YO MATSUO; SHIGEYUKI YOKOYAMA

doi:10.1110/ps.9.12.2313

Abstract

Structural genomic projects envision almost routine protein structure determinations, which are currently imaginable only for small proteins with molecular weights below 25,000 Da. For larger proteins, structural insight can be obtained by breaking them into small segments of amino acid sequences that can fold into native structures, even when isolated from the rest of the protein. Such segments are autonomously folding units (AFU) and have sizes suitable for fast structural analyses. Here, we propose to expand an intuitive procedure often employed for identifying biologically important domains to an automatic method for detecting putative folded protein fragments. The procedure is based on the recognition that large proteins can be regarded as a combination of independent domains conserved among diverse organisms. We thus have developed a program that reorganizes the output of BLAST searches and detects regions with a large number of similar sequences. To automate the detection process, it is reduced to a simple geometrical problem of recognizing rectangular shaped elevations in a graph that plots the number of similar sequences at each residue of a query sequence. We used our program to quantitatively corroborate the premise that segments with conserved sequences correspond to domains that fold into native structures. We applied our program to a test data set composed of 99 amino acid sequences containing 150 segments with structures listed in the Protein Data Bank, and thus known to fold into native structures. Overall, the fragments identified by our program have an almost 50% probability of forming a native structure, and comparable results are observed with sequences containing domain linkers classified in SCOP. Furthermore, we verified that our program identifies AFU in libraries from various organisms, and we found a significant number of AFU candidates for structural analysis, covering an estimated 5 to 20% of the genomic databases. Altogether, these results argue that methods based on sequence similarity can be useful for dissecting large proteins into small autonomously folding domains, and such methods may provide an efficient support to structural genomics projects.

Crossref Citations

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Brenner, Steven E. 2001. A tour of structural genomics. Nature Reviews Genetics, Vol. 2, Issue. 10, p. 801.

Rigden, Daniel J. 2002. Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments. Protein Engineering, Design and Selection, Vol. 15, Issue. 2, p. 65.

George, Richard A. and Heringa, Jaap 2002. Protein domain identification and improved sequence similarity searching using PSI‐BLAST. Proteins: Structure, Function, and Bioinformatics, Vol. 48, Issue. 4, p. 672.

George, Richard A and Heringa, Jaap 2002. SnapDRAGON: a method to delineate protein structural domains from sequence data. Journal of Molecular Biology, Vol. 316, Issue. 3, p. 839.

Nagarajan, Niranjan and Yona, Golan 2003. A multi-expert system for the automatic detection of protein domains from sequence information. p. 224.

Aceti, David J. Blommel, Paul G. Endo, Yaeta Fox, Brian G. Frederick, Ronnie O. Hegeman, Adrian D. Jeon, Won Bae Kimball, Todd L. Lee, Jason M. Newman, Craig S. Peterson, Francis C. Sawasaki, Tatsuya Seder, Kory D. Sussman, Michael R. Ulrich, Eldon L. Wrobel, Russell L. Thao, Sandy Vinarov, Dmitriy A. Volkman, Brian F. Zhao, Qin Fahnestock, Stephen R. and Steinbüchel, Alexander 2003. Biopolymers Online.

Tanaka, Takanori Yokoyama, Shigeyuki and Kuroda, Yutaka 2006. Improvement of domain linker prediction by incorporating loop‐length‐dependent characteristics. Peptide Science, Vol. 84, Issue. 2, p. 161.

Hondoh, Takayuki Kato, Atsushi Yokoyama, Shigeyuki and Kuroda, Yutaka 2006. Computer‐aided NMR assay for detecting natively folded structural domains. Protein Science, Vol. 15, Issue. 4, p. 871.

Miyazaki, Satoshi Kuroda, Yutaka and Yokoyama, Shigeyuki 2006. Identification of putative domain linkers by a neural network – application to a large sequence database. BMC Bioinformatics, Vol. 7, Issue. 1,

Ye, Lei Liu, Ting Wu, Zhaohui and Zhou, Ruhong 2008. Sequence‐based protein domain boundary prediction using BP neural network with various property profiles. Proteins: Structure, Function, and Bioinformatics, Vol. 71, Issue. 1, p. 300.

Ingolfsson, Helgi and Yona, Golan 2008. Structural Proteomics. Vol. 426, Issue. , p. 117.

Ebina, Teppei Toh, Hiroyuki and Kuroda, Yutaka 2009. Loop‐length‐dependent SVM prediction of domain linkers for high‐throughput structural proteomics. Peptide Science, Vol. 92, Issue. 1, p. 1.

Chikayama, Eisuke Kurotani, Atsushi Tanaka, Takanori Yabuki, Takashi Miyazaki, Satoshi Yokoyama, Shigeyuki and Kuroda, Yutaka 2010. Mathematical model for empirically optimizing large scale production of soluble protein domains. BMC Bioinformatics, Vol. 11, Issue. 1,

Ebina, Teppei Toh, Hiroyuki and Kuroda, Yutaka 2011. DROP: an SVM domain linker predictor trained with optimal features selected by random forest. Bioinformatics, Vol. 27, Issue. 4, p. 487.

Eickholt, Jesse Deng, Xin and Cheng, Jianlin 2011. DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning. BMC Bioinformatics, Vol. 12, Issue. 1,

Ebina, Teppei Umezawa, Yuki and Kuroda, Yutaka 2013. IS-Dom: a dataset of independent structural domains automatically delineated from protein structures. Journal of Computer-Aided Molecular Design, Vol. 27, Issue. 5, p. 419.

Ebina, Teppei Suzuki, Ryosuke Tsuji, Ryotaro and Kuroda, Yutaka 2014. H-DROP: an SVM based helical domain linker predictor trained with features optimized by combining random forest and stepwise selection. Journal of Computer-Aided Molecular Design, Vol. 28, Issue. 8, p. 831.

Shen, Xinchun Chen, Siyuan and Ge, Heng 2014. Combinatory use of cell-free protein expression, limited proteolysis and mass spectrometry for the high-throughput protein domain identification. Biochemical and Biophysical Research Communications, Vol. 444, Issue. 4, p. 480.

Kurotani, Atsushi Yamada, Yutaka Shinozaki, Kazuo Kuroda, Yutaka and Sakurai, Tetsuya 2015. Plant-PrAS: A Database of Physicochemical and Structural Properties and Novel Functional Regions in Plant Proteomes. Plant and Cell Physiology, Vol. 56, Issue. 1, p. e11.

Wang, Yan Wang, Jian Li, Ruiming Shi, Qiang Xue, Zhidong and Zhang, Yang 2017. ThreaDomEx: a unified platform for predicting continuous and discontinuous protein domains by multiple-threading and segment assembly. Nucleic Acids Research, Vol. 45, Issue. W1, p. W400.

Download full list

Article contents

Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics

Abstract

Keywords

Access options

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Article contents

Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics

Abstract

Keywords

Access options

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests