English  |  正體中文  |  简体中文  |  Items with full text/Total items : 28588/40619
Visitors : 4201246      Online Users : 52
RC Version 4.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Adv. Search

Please use this identifier to cite or link to this item: http://ntour.ntou.edu.tw:8080/ir/handle/987654321/35811

Title: 基於序列排比與支持向量機之 蛋白質域邊界辨識
Interdomain Boundary Detection by Sequence Alignment and Support Vector Machine
Authors: Cing-Han Yang
Contributors: NTOU:Department of Computer Science and Engineering
Keywords: 蛋白質功能域;功能域邊界;胺基酸對;二級結構;LIBSVM
protein functional domain;domain boundary;amino acid pairs;secondary structure;LIBSVM
Date: 2013
Issue Date: 2013-10-07T02:59:26Z
Abstract: 蛋白質是由基本的功能域所組成,而功能域則是演化過程的基本單元。大部分的蛋白質是同時具備多個功能域,這類多功能域蛋白的形成有可能是物種進化過程時經由選擇壓力所造成的現象。不同類型多功能域的組合有可能會影響結構的穩定性,也會參與蛋白質與蛋白質間的相互作用或細胞週期的調控,但若是蛋白質功能域發生突變,將會造成蛋白質結構異常而產生病變。本研究的主要目標是透過蛋白質域資料庫序列的比對分析及使用機器學習技術提升功能域之間邊界的正確辨識,希望藉由辨識結果強化對研究蛋白質結構的穩定度、蛋白質功能的註解、蛋白質域的交互作用、和生物演化過程的探討。系統完整收集已知的蛋白質序列並建構蛋白質域資料庫,統計分析功能域間邊界序列胺基酸對的組合內容、邊界序列長度及二級結構元素的分布資訊,藉由這些資料進行特徵訓練且設計一套自動辨識功能域邊界的方法。 本研究首先建立蛋白質域的基本序列資料庫,這些已知功能域的序列資訊是來自蛋白質家族資料庫(Pfam),選擇經專家鑑定的Pfam-A資料集篩選12,273種不同功能域的代表性序列作為序列比對的基本資料庫。當第一階段無法藉由傳統序列排比方法偵測出查詢序列的功能域位置時,則進一步使用二級結構預測工具進行結構預測分析,再加上已知功能域的統計分析資訊、功能域邊界胺基酸對組合機率及功能域長度作為特徵,運用支持向量機分類器的技術自動分析辨識蛋白質功能域邊界,本研究的實驗結果針對1868筆蛋白質序列進行功能域邊界位置的自動判別可以達到86%的準確率,該項蛋白質域邊界自動辨識系統可以提供生物學者在設計生物實驗之前進行具有實用價值的分析參考。
A protein is composed of at least one functional domain which is considered as a fundamental evolutionary unit. Most of proteins contain multiple domains, and which are formulated through gene duplication events and likely caused by a selective pressure during evolution. Different domain combinations involve in protein structure stability, protein-protein interactions, and cell-cycle regulation. However, the mutations of functional domains might result in an abnormal protein structure and lead to serious diseases. The main goal of this study is to combine protein sequence alignment and machine learning approaches to improve interdomain boundary detection. Accurate domain boundary detection could provide a powerful technique and useful information for the studies of protein structure stability, functional annotation, protein domain interaction, and evolutionary biology. In this study, we have collected comprehensive protein sequences and corresponding domain annotations as the training datasets. Features of amino acid pairs, length of interdoamin boundary, and distribution of secondary structure elements were analyzed and trained for identifying locations of protein domain boundary automatically. In this thesis, a protein sequence database containing protein domain annotations was established for sequence alignment. The known protein domain sequences are derived from Pfam database, where Pfam-A provides 12,273 representative sequences for different domain annotations by experts. If the domain characteristics of the query protein couldn’t be verified through the sequence alignment at the first stage, a secondary structure prediction tool would be applied for an alternative approach. Integrating the statistical characteristics of domains, occurrence frequencies of amino acid pairs, and length distributions of domain boundary, we employed a support vector machine classifier to identify protein domain boundaries. The proposed system achieved a precision rate of 86% on a testing set of 1868 proteins, and it has shown that our system can automatically detect interdomain boundaries from an unknown protein sequence. This identification system provides biologists a useful and practical advice prior to design biological experiments.
URI: http://ethesys.lib.ntou.edu.tw/cdrfb3/record/#G0019957012
Appears in Collections:[資訊工程學系] 博碩士論文

Files in This Item:

File Description SizeFormat

All items in NTOUR are protected by copyright, with all rights reserved.


著作權政策宣告: 本網站之內容為國立臺灣海洋大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,請合理使用本網站之內容,以尊重著作權人之權益。
網站維護: 海大圖資處 圖書系統組
DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback