Please use this identifier to cite or link to this item:
Promoter Sequences Prediction Based on Genetic Algorithms and Markov chain
|Authors: ||Yi-Jen Chen|
|Contributors: ||NTOU:Department of Communications Navigation and Control Engineering|
Bioinformatics;Promoter prediction;Score method;Markov chain;Real valued genetic algorithms
|Issue Date: ||2011-07-04
|Abstract: ||摘要 由於人類基因體計畫與水稻基因體計畫的相繼完成，核苷酸與蛋白質序列的資料正以驚人的速度累積；同時越來越多相關的資料庫也陸續的被建構。如何有效的定義與維持這些資料是個極為重要的課題。雖然我們可以藉由生物實驗得知DNA 序列是否已被轉錄，但通常需要花費相當多的時間及成本。如何設計計算機演算法則和開發相關程式軟體以分析和定義基因序列已成為當今最重要挑戰之一。啟動子為DNA轉錄成RNA及調控基因表現的重要關鍵。經由啟動子的研究，我們不僅可以預測DNA序列可否被轉錄成RNA，甚至能更進一步預測DNA序列可轉錄成何種特定RNA。本論文旨在於發展一套具高辨識率的啟動子序列預測法則。首先我們提出一簡單快速且具高度靈敏的分數法則。藉由序列上每一位置上核苷酸出現的頻率以建構出啟動子的預測模型。其中包括了TSS 已校準及TSS未校準時的兩種情形。接著為了提昇啟動子預測的準確性，我們發展出結合基因演算法和分數法的最佳化預測法則。同時，亦藉由馬可夫鍊及Kullback Leibler’s distance技巧發展出兼具高精確度及低計算複雜度的預測法則。最後，我們結合之前所提馬可夫鍊模型及實數型基因演算法則以建構一最具效能的啟動子預測策略。本論文使用UCI Machine Learning Repository中大腸桿菌的資料作為結果驗證。且將分別藉由自體辨識與交叉確認進行啟動子序列分析與預測。模擬結果顯示所提方法將可大幅改善啟動子預測的準確度。|
Abstract As a result of the Human Genome Project (HGP, 2001) and International Rice Genome Sequencing Project (IRGSP, 2002), nucleic acid and protein sequences data are accumulated at an exponential rate. More and more relative valued databases are developed and it is important to maintain and annotate such data. Whether a DNA sequence transcribed or not can be verified by biological experiments, but experiments are usually time consuming and take high cost. How to design good computer algorithms and software to analyze and annotate gene sequences becomes one of the most important subjects today. Promoters are the transcription signals, which regulate the gene expressions, and are responsible for the transcription from DNA to RNA. Through the study on promoters, we can find out which DNA sequence will be transcribed into RNA, and we can even transcribe any DNA sequence which we intend to study into RNA. The goal of this thesis is to develop efficient prediction algorithms such that promoter sequences can be identified accurately. Accordingly, simple, fast and sensitive score models which use the frequencies of nucleotides at each position in sequences will be proposed to solve the promoter prediction problems, which involve the transcriptional start site (TSS) alignment and un-alignment situations. Then optimization methods for promoter sequences prediction based on score methods and genetic algorithm are developed to enhance the performance of prediction accuracy. Furthermore, with the aid of Markov chain and Kullback Leibler’s distance, probability models that are capable of identifying promoter sequences efficiently at the low computational expense and the better prediction accuracy will be derived. Finally, a powerful promoter prediction strategy which combines previously Markov model and real-valued genetic algorithm is proposed to solve the promoter prediction problems. In this thesis, we take the Escherichia Coli (E. coli) from the UCI Machine Learning Repository as our datasets. The performances of the proposed algorithms are verified by both self-recognition and cross-validation. Simulation results with the high prediction accuracy illustrate the effectiveness of the proposed algorithms.
|Appears in Collections:||[通訊與導航工程學系] 博碩士論文|
Files in This Item:
All items in NTOUR are protected by copyright, with all rights reserved.