The development of a data lakehouse system for the integration and management of cyber threat intelligence data in XYZ unit
Main Article Content
Abstract
Cybersecurity systems are evolving to deal with increasingly complex digital threats. One of the main challenges in this field is integrating and managing Cyber Threat Intelligence (CTI) efficiently. This research aims to design and implement Data Lakehouse as a solution to manage CTI data in XYZ Unit. The system was built using Apache Spark, MinIO, Dremio, Nessie, and Apache Iceberg with a containerization approach using Docker to ensure flexibility and ease of implementation. The implementation results show that the system successfully integrates various CTI data sources and improves efficiency in data storage, processing, and analysis. MinIO is used as the primary storage, Apache Spark processes data at scale, Dremio enables real-time data analysis, and Nessie manages data version control to maintain its integrity. Blackbox testing proves that the system can work optimally, with results showing improved data integration and efficiency in managing cyber threat information. Thus, the developed Data Lakehouse can be an effective solution in supporting threat detection and strategic decision-making in XYZ Unit.
Downloads
Article Details
Andriyani, W., Dawis, A. M., & Purnomo, R. (2023). DATA LAKE INSIGHTS. In Widina Media Utama (First Edit). Widina Media Utama. http://scioteca.caf.com/bitstream/handle/123456789/1091/RED2017-Eng-8ene.pdf?sequence=12&isAllowed=y%0Ahttp://dx.doi.org/10.1016/j.regsciurbeco.2008.06.005%0Ahttps://www.researchgate.net/publication/305320484_SISTEM_PEMBETUNGAN_TERPUSAT_STRATEGI_MELESTARI
Armbrust, M., Ghodsi, A., Xin, R., & Zaharia, M. (2021). Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. 11th Annual Conference on Innovative Data Systems Research, CIDR 2021.
Begoli, E., Goethert, I., & Knight, K. (2021). A Lakehouse Architecture for the Management and Analysis of Heterogeneous Data for Biomedical Research and Mega-biobanks. Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021, December 2021, 4643–4651. https://doi.org/10.1109/BigData52589.2021.9671534
Giebler, C., Gröger, C., Hoos, E., Schwarz, H., & Mitschang, B. (2019). Leveraging the Data Lake: Current State and Challenges. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11708 LNCS(DaWaK), 179–188. https://doi.org/10.1007/978-3-030-27520-4_13
Haerani, R., Hendriyati, P., Nugroho, P. A., & Lukman, M. (2023). Waterfall Model Implementation in Information Systems Web Based Goods Delivery Service. JURTEKSI (Jurnal Teknologi Dan Sistem Informasi), 9(3), 501–508. https://doi.org/10.33330/jurteksi.v9i3.2267
Harahap, A. H., Difa Andani, C., Christie, A., Nurhaliza, D., & Fauzi, A. (2023). Pentingnya Peranan CIA Triad Dalam Keamanan Informasi dan Data Untuk Pemangku Kepentingan atau Stakholder. Jurnal Manajemen Dan Pemasaran Digital, 1(2), 73–83.
Harby, A. A., & Zulkernine, F. (2024). Data Lakehouse: A Survey and Experimental Study. Information Systems, 00(July 2024), 1–23. https://doi.org/10.1016/j.is.2024.102460
Helmiawan, M. A., Akbar, Y. H., & Mahardika, F. (2024). Keamanan Teknologi Informa: Teori, Risiko, dan Strategi Pertahanan di Era Digital. In U. Press (Ed.), UNSAP Press 2024. UNSAP Press. http://scioteca.caf.com/bitstream/handle/123456789/1091/RED2017-Eng-8ene.pdf?sequence=12&isAllowed=y%0Ahttp://dx.doi.org/10.1016/j.regsciurbeco.2008.06.005%0Ahttps://www.researchgate.net/publication/305320484_SISTEM_PEMBETUNGAN_TERPUSAT_STRATEGI_MELESTARI
Heriyanti, F., & Ishak, A. (2020). Design of logistics information system in the finished product warehouse with the waterfall method: Review literature. IOP Conference Series: Materials Science and Engineering, 801(1). https://doi.org/10.1088/1757-899X/801/1/012100
Janssen, N. E. (2022). The Evolution of Data Storage Architectures : Examining the Value of the Data Lakehouse. 133.
Kayode-Ajala, O. (2023). Applications of Cyber Threat Intelligence (CTI) in Financial Institutions and Challenges in Its Adoption. Applied Research of Artificial Intelligence and Cloud Computing, 6(1), 1–21. https://researchberg.com/index.php/araic/article/view/159
Khine, P. P., & Wang, Z. S. (2018). Data lake: a new ideology in big data era. ITM Web of Conferences, 17, 03025. https://doi.org/10.1051/itmconf/20181703025
Lavrentyeva, Y., & Sherstnev, A. (2022). The Definitive Guide to Data Warehouse vs. Data Lake vs. Data Lakehouse — ITRex. https://itrexgroup.com/blog/data-warehouse-vs-data-lake-vs-data-lakehouse-differences-use-cases-tips/#
Le, D., Kumar, R., Mishra, B. K., Khari, M., & Chatterjee, J. M. (2019). Cyber Security in Parallel and Distributed Computing. Cyber Security in Parallel and Distributed Computing. https://doi.org/10.1002/9781119488330
Mallick, A. I., & Nath, R. (2024). Navigating the Cyber security Landscape: A Comprehensive Review of Cyber-Attacks, Emerging Trends, and Recent Developments. World Scientific News: An International Scientific Journal, 190(1), 1–69. www.worldscientificnews.com
Mehmood, H., Gilman, E., Cortes, M., Kostakos, P., Byrne, A., Valta, K., Tekes, S., & Riekki, J. (2019). Implementing big data lake for heterogeneous data sources. Proceedings - 2019 IEEE 35th International Conference on Data Engineering Workshops, ICDEW 2019, 37–44. https://doi.org/10.1109/ICDEW.2019.00-37
Mitchell, O., & Osazuwa, C. (2023). Confidentiality, Integrity, and Availability in Network Systems: A Review of Related Literature. International Journal of Innovative Science and Research Technology, 8(12). https://doi.org/10.5281/zenodo.10464076
Molitor, D., Raghupathi, W., Saharia, A., & Raghupathi, V. (2023). Exploring Key Issues in Cybersecurity Data Breaches: Analyzing Data Breach Litigation with ML-Based Text Analytics. Information (Switzerland), 14(11). https://doi.org/10.3390/info14110600
Orescanin, D., & Hlupic, T. (2021). Data Lakehouse - a Novel Step in Analytics Architecture. 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO), 1242–1246. https://doi.org/10.23919/MIPRO52101.2021.9597091
Schaberreiter, T., Kupfersberger, V., Rantos, K., Spyros, A., Papanikolaou, A., Ilioudis, C., & Quirchmayr, G. (2019). A quantitative evaluation of trust in the quality of cyber threat intelligence sources. ACM International Conference Proceeding Series. https://doi.org/10.1145/3339252.3342112
Seh, A. H., Zarour, M., Alenezi, M., Sarkar, A. K., Agrawal, A., Kumar, R., & Khan, R. A. (2020). Healthcare data breaches: Insights and implications. Healthcare (Switzerland), 8(2). https://doi.org/10.3390/healthcare8020133
Spiga, D., Ciangottini, D., Costantini, A., Cutini, S., Duma, C., Gasparetto, J., Lubrano, P., Martelli, B., Ronchieri, E., Salomoni, D., Sergi, G., Storchi, L., & Tracolli, M. (2022). Open-source and cloud-native solutions for managing and analyzing heterogeneous and sensitive clinical Data. Proceedings of Science, 415(Isgc), 21–25. https://doi.org/10.22323/1.415.0022
Wang, P., & Johnson, C. (2018). Cybersecurity Incident Handling: a Case Study of the Equifax Data Breach. Issues In Information Systems, 19(3), 150–159. https://doi.org/10.48009/3_iis_2018_150-159
Xin, Y., Kong, L., Liu, Z., Chen, Y., Li, Y., Zhu, H., Gao, M., Hou, H., & Wang, C. (2018). Machine Learning and Deep Learning Methods for Cybersecurity. IEEE Access, 6, 35365–35381. https://doi.org/10.1109/ACCESS.2018.2836950
Yin, L., Fang, B., Guo, Y., Sun, Z., & Tian, Z. (2020). Hierarchically defining Internet of Things security: From CIA to CACA. International Journal of Distributed Sensor Networks, 16(1). https://doi.org/10.1177/1550147719899374
Zhao, J., Yan, Q., Li, J., Shao, M., He, Z., & Li, B. (2020). TIMiner : Automatically extracting and analyzing categorize d cyb er threat intelligence from social data. Computers & Security Journal, 95. https://doi.org/10.1016/j.cose.2020.101867

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.