English  |  正體中文  |  简体中文  |  Items with full text/Total items : 27273/39116
Visitors : 2441161      Online Users : 48
RC Version 4.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Adv. Search

Please use this identifier to cite or link to this item: http://ntour.ntou.edu.tw:8080/ir/handle/987654321/17919

Title: 以URL資訊為基礎的相似郵件偵測系統設計
Near-Duplicate Mail Detection Based On URL Information
Authors: Chia-Hui Lin
Contributors: NTOU:Department of Computer Science and Engineering
Keywords: 垃圾信
Date: 2004
Issue Date: 2011-07-04
Abstract: 網際網路上的商業蓬勃發展,許多商人利用電子郵件做為廣告的媒介,一些沒有經過使用者同意而濫發的電子信件已經達到氾濫的程度了。電子郵件地址的收集容易,且大量發送電子郵件的成本遠比傳統郵件的成本低,又具有即時性。因此透過電子郵件傳送廣告訊息的方式成為一個重要的商業行銷手法。然而也因此造成垃圾郵件的氾濫。如果公司願意花一樣多的錢在發送垃圾的電子郵件上,一個網路的使用者一天大概會收到超過數百封的垃圾郵件。而每次使用者只要傳送訊息,或是在網路上填一些輸入的表格,甚至是使用商品的註冊卡在商品網站註冊,而這些動作就會容易讓電子信箱被垃圾郵件的發送者取得。垃圾郵件氾濫造成伺服器及網際網路資源不必要的浪費,同時也造成電子郵件使用者相當的困擾,此問題嚴重威脅網際網路資源的有效運用及電子郵件的使用。 近幾年,許多的研究者已經意識到對抗垃圾郵件技術的重要性。許多反垃圾郵件的技術都被提出。這些技術大部分都是利用每封信件的內容是否隱含不當的廣告訊息作為判斷。而在我們這篇論文中,我們提出利用郵件間的相似性來做為相似郵件的偵測。我們的特徵選取是採用郵件內容中包含的URL訊息。一般說來,一封廣告信都會被大量複製與傳送給網路上使用者。而傳送給這些不同使用者的信其實都是來自相同的一封廣告信。如果是這樣的情況,那麼其實採用偵測複本的技術就足夠了。但是,垃圾信的發送者已經聰明許多去防範這樣的偵測,他們已經可以產生出像是隨便加入一些文字或是針對收件人的名字,收件人的信箱,或是信件的主題等等依據收件者做個人化處理。這些信件看起來當然不同,但實際上確實是來自同一封垃圾信。 因此,我們在本篇論文中,我們設計一個以URL訊息為基礎的相似郵件偵測系統,希望可以協助垃圾郵件的偵測。同時,研究不同的複本偵測方法,實作他們的系統並與我們的系統效能評比。在實驗中我們利用真實的郵件,讓我們的實驗結果更具有意義。我們主要比較Octet-based histogram方法,以及 其他兩種最近提出的相似文件判別技術(I-Mach 及 Winnowing)。研究結果顯示,以目前我們收集的垃圾郵件為檢測樣本,我們所提出的方法相較其他三種方法的準確率皆較佳。同時我們也分析不同的spam技術對本方法的影響。這些成果相信對於未來設計一個有效的垃圾郵件偵測系統有相當的幫助。
As the commercialization of the Internet continues, unsolicited email reached epidemic proportions when more and more marketers have turned to email as an advertising medium [27]. The usefulness of email is seriously threatened by the commercialization of the Internet because it is easier than to collect address lists and cheaper than to mass-distribute messages. If companies spend as much money in sending junk email as they do in sending junk physical mail, an Internet user would likely get more than hundreds of junk messages per day. Every time an Internet user explores his/her email address to the public, for example listing or registering in a web site, spammer can obtain his/her email address easily. Recently, many researchers have become aware of importance of developing techniques to overcome spam. Various anti-spam techniques have been proposed, and most of them are based on intra-mail scan methods. In this thesis, we provide an inter-mail scan scheme for spam detection based on URL information of mail content. Usually, a spam will be massively reproduced and delivered to the Internet users. Many copies of a spam delivered to different receivers are identical copies, and they can be easily detected by exactly compare their unique fingerprint. However, more and more intelligent spam delivery systems are able to generate customized copies (according to, for example, email receiver, email address, email subject) for different receivers. Contexts of these customized copies are not exactly the same, but they deliver same message to the receivers indeed. In this thesis, a near-duplicated mail detection system is developed based on URL information. Rich empirical results are reported with real mails for training and testing according to different spam behaviors. Meanwhile to have better knowledge about the strength of the proposed scheme, different approaches of near duplicate (mail/document) detection schemes are investigated and compared in terms of accuracy. We compare three different approaches available in literature: Octet-based histogram method, I-Mach, and Winnowing. Using over thousands of real mail we collected as testing sample, the experiment results show that the proposed strategy outperforms the other three approaches in terms of accuracy. We hope the results of this thesis study can give more insights on spam/anti-spam techniques for the design of spam filtering systems.
URI: http://ethesys.lib.ntou.edu.tw/cdrfb3/record/#G0M92570007
Appears in Collections:[資訊工程學系] 博碩士論文

Files in This Item:

File Description SizeFormat

All items in NTOUR are protected by copyright, with all rights reserved.


著作權政策宣告: 本網站之內容為國立臺灣海洋大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,請合理使用本網站之內容,以尊重著作權人之權益。
網站維護: 海大圖資處 圖書系統組
DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback