研究生: |
湯儀君 Tang, Yi-Jun |
---|---|
論文名稱: |
以生成對抗網路透過逐步替換實現之資料合成方法 A Data Synthesis Approach through Stepwise GAN-based Substitution |
指導教授: |
紀博文
Chi, Po-Wen |
口試委員: |
紀博文
Chi, Po-Wen 王銘宏 Wang, Ming-Hung 曾一凡 Tseng, Yi-Fan 官振傑 Guan, Albert |
口試日期: | 2022/08/08 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 英文 |
論文頁數: | 59 |
中文關鍵詞: | 資料匿名化 、資料合成 |
英文關鍵詞: | Data Anonymization, Data Synthesis |
研究方法: | 實驗設計法 |
DOI URL: | http://doi.org/10.6345/NTNU202301086 |
論文種類: | 學術論文 |
相關次數: | 點閱:94 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
現今,機器學習的技術被應用在廣泛的領域中,如顧客喜好分析、生物辨識,甚至醫療體系中,這些應用無疑幫助著人們做出更準確的決策。然而,當包含個人資訊的資料集變得更加龐大也更加詳細的同時,資料隱私保護的技術與規範卻沒有相應幅度的進展。
著名的資料匿名研究,以 k 匿名 (k-Anonymity) 為首,然而 k 匿名受限於較低維度的資料集,並且存在容易受到背景知識攻擊 (Background-knowledge Attack)的缺點;差分隱私 (Differential Privacy) 是近十年來受到矚目的匿名化技術,在數學上提供了強大的隱私保證,能確保資料集當中沒有任何一筆資料被重新識別,但在實際應用層面上,卻會遇到無法兼顧隱私保護跟資料實用性的考驗。
希臘哲學家普魯塔克以著名傳說 ── 忒休斯之船為例,對於一一抽換掉船中的木頭後,忒休斯之船是否仍然是原本的那艘船提出大哉問。其中,我們參考了英國政治哲學家霍布斯對此問題的結論作為研究的發想,提出了忒休斯抽換機制(Theseus Data Synthesis Approach)。透過不斷的抽換資料集當中的部分資料,待最後所有資料都被取代完畢後,產生一個相似於原始資料集的合成資料集,並且確保合成的資料集當中沒有任何一筆資料是來自於原始資料集,以此避免資料被重新識別的可能性。此外,本文亦提出對於忒休斯抽換機制的安全性及相似性的論證模型。
資料合成 (Data Synthesis) 的技術可以用於創造缺失的資料或者拓展資料集的大小,近年來也被使用在較需要隱私保護的醫療資料研究中。本文參考過去的研究,以生成對抗網路 (Generative Adversarial Networks) 來產生合成資料集,並以生成對抗網路本身的隨機性來取代相關研究中在生成器的損失函數上額外加入的噪聲,以此提高與原始資料集的相似度。
本文最後對合成資料集與原始資料集的相似度與合成資料集的實用度進行了分析,探討不同的抽換比例與相關研究在生成品質上的差異,發現在抽換比例較小的情況下產生的合成資料集與原始資料集的相似度較高,亦優於相關研究,此外,更提供了較佳的預測品質。
The utilization of machine learning algorithms to achieve various tasks such as customer preference prediction, facial/voice recognition or even medical diagnosis assist us human to make more accurate decisions. However, as personal information in electronic data become more massive and detailed, the progress of data privacy protection doesn't seem to catch on with the rapidly improved data curation techniques.
Previously developed data anonymization methods such as k-anonymity suffered
from background knowledge attacks and is limited to low dimensional and centralized input datasets; Differential privacy provide strong mathematical guarantee for algorithms analyzing datasets and as a result prevent individual data participated in the dataset from being identified. With rigorous model ensuring data provider's unidentifiability, differential privacy however, faces huge hindrance when applying to real-world applications due to the tradeoff between data quality and privacy.
Greek philosopher Plutarch questioned about whether the ship of Theseus would remain the same if it were entirely replaced, piece by piece. Among all the discussions of the question proposed by Plutarch, we draw inspiration from the conclusion of English philosopher Thomas Hobbes to build our own mechanism ──Theseus Data Synthesis Approach (TDSA). We generate our synthetic data by replacing partial records until no record from the original dataset remains. This can prevent the possibility of data in the original dataset being re-identified from released synthetic dataset. Furthermore, we also proposed a similarity and security scheme for our replacement mechanism.
Data synthesis can be utilize on constructing missing values in dataset or data augmentation. It is also implemented on relatively sensitive medical datasets in recent years.
We generate our synthetic data by GAN framework based on previous researches, but utilizing the randomness of GAN itself rather than adding noises via additional loss function like other works did. In order to preserve the similarity of generated synthetic data to the original dataset.
We analyze the quality and utility of synthetic dataset with different settings of our
proposed mechanism, and compare them with related work. We conclude that, with small proportion of replacement settings, we can derive higher quality of synthetic data.
[1] Coronavirus privacy: Are South Korea’s alerts too revealing? https://www.bbc.com/news/world-asia-51733145. Date accessed: 14.07.2022.
[2] How One of Apple’s Key Privacy Safeguards Falls Short. https://www.wired.com/story/apple-differential-privacy-shortcomings/. Date accessed: 14.07.2022.
[3] Medical Cost Personal Datasets. https://www.kaggle.com/datasets/mirichoi0218/insurance. Date accessed: 16.07.2022.
[4] NIST definition of PII. https://csrc.nist.gov/glossary/term/PII. Date accessed: 14.07.2022.
[5] Personal Key Indicators of Heart Disease. https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease. Date accessed:16.07.2022.
[6] REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 27 April 2016. https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679. Date accessed: 01.07.2022.
[7] Ship of Theseus. https://en.wikipedia.org/wiki/Ship_of_Theseus. Date accessed: 14.07.2022.
[8] The Personal Data Protection Act (PDPA) in Taiwan. https://law.moj.gov.tw/LawClass/LawAll.aspx?PCode=I005002. Date accessed: 01.07.2022.
[9] Thomas hobbes. de corpore, 2.11. https://rintintin.colorado.edu/~vancecd/phil1020/Hobbes.pdf. Date accessed: 14.07.2022.
[10] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
[11] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
[12] F. K. Dankar and M. Ibrahim. Fake it till you make it: Guidelines for effective synthetic data generation. Applied Sciences, 11(5):2158, 2021.
[13] J. Domingo-Ferrer, D. Sánchez, and A. Blanco-Justicia. The limits of differential privacy (and its misuse in data release and machine learning). Communications of the ACM, 64(7):33–35, 2021.
[14] C. Dwork. Differential privacy. In Proceedings of the 33rd International Conference on Automata, Languages and Programming - Volume Part II, ICALP’06, page 1–12, Berlin, Heidelberg, 2006. Springer-Verlag.
[15] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211–407, aug 2014.
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
[17] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017.
[18] J. Hsu, M. Gaboardi, A. Haeberlen, S. Khanna, A. Narayan, B. C. Pierce, and A. Roth. Differential privacy: An economic method for choosing epsilon. In 2014 IEEE 27th Computer Security Foundations Symposium, pages 398–410. IEEE, 2014.
[19] J. Hu. Bayesian estimation of attribute and identification disclosure risks in synthetic data. arXiv preprint arXiv:1804.02784, 2018.
[20] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3–es, 2007.
[21] B. Malin and L. Sweeney. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. Journal of biomedical informatics, 37(3):179–192, 2004.
[22] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
[23] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pages 111–125. IEEE, 2008.
[24] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar. Semisupervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755, 2016.
[25] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim. Data synthesis based on generative adversarial networks. Proc. VLDB Endow., 11(10):1071–1083, jun 2018.
[26] N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399–410. IEEE, 2016.
[27] H. Ping, J. Stoyanovich, and B. Howe. Datasynthesizer: Privacy-preserving synthetic datasets. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pages 1–5, 2017.
[28] D. B. Rubin. Statistical disclosure limitation. Journal of official Statistics, 9(2):461–468, 1993.
[29] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: kanonymity and its enforcement through generalization and suppression. Technical report, 1998.
[30] R. Singel. Netflix spilled your brokeback mountain secret. Lawsuit Claims, 2017.
[31] J. Snoke, G. M. Raab, B. Nowok, C. Dibben, and A. Slavkovic. General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181(3):663–688, 2018.
[32] L. Sweeney. Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine & Ethics, 25(2-3):98–110, 1997.
[33] L. Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002.
[34] P. K. Trivedi, D. M. Zimmer, et al. Copula modeling: an introduction for practitioners. Foundations and Trends® in Econometrics, 1(1):1–111, 2007.
[35] J. Yoon, L. N. Drumright, and M. Van Der Schaar. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE journal of biomedical and health informatics, 24(8):2378–2388, 2020.
[36] J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4):1–41, 2017.