Privacy-Preserving Ensemble Methods for Email Spam Detection
Main Article Content
Abstract
Email is a private medium of communication, and the inherent privacy constraints form a major obstacle in developing effective spam filtering methods which require access to a large amount of email data belonging to multiple users. To mitigate this problem, we envision a privacy-preserving spam filtering system, where the server can train and evaluate a logistic regression-based spam classifier on the combined email data of all users without being able to observe any emails. This is achieved using primitives such as homomorphic encryption and randomization. We analyze the protocols for correctness and security and perform experiments with a prototype system on a large-scale spam filtering task. State-of-the-art spam filters often use character n-grams as features, resulting in large sparse data representations that are not feasible to use directly with our training and evaluation protocols. We explore various data-independent dimensionality reduction techniques to decrease the running time of the protocol, making it feasible to use in practice while achieving high accuracy.