Distributed Machine Learning using HDFS and Apache Spark for Big Data Challenges

Open Access

Issue		E3S Web of Conf. Volume 465, 2023 8^th International Conference on Industrial, Mechanical, Electrical and Chemical Engineering (ICIMECE 2023)


Article Number		02058
Number of page(s)		11
Section		Symposium on Electrical, Information Technology, and Industrial Engineering
DOI		https://doi.org/10.1051/e3sconf/202346502058
Published online		18 December 2023

Apache Hadoop. (2023). [Online]. Available: https://hadoop.apache.org. [Google Scholar]
Apache Spark. (2023). [Online]. Available: https://spark.apache.org. [Google Scholar]
Jupyter Notebook. (2023). [Online]. Available: https://jupyter.org [Google Scholar]
Streamlit. (2023). [Online]. Available: https://streamlit.io [Google Scholar]
Ahmed, N., Andre L. C. B., Mohammad A. R., and Teo S. (2021). A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters. J Big Data 8:107 Doi : 10.1186/s40537-021-00499-7 [CrossRef] [Google Scholar]
Amannejad, Y., Sarah S., Diwakar K., and Mea W. (2019). Fast and Lightweight Execution Time Predictions for Spark Applications. IEEE 12th International Conference on Cloud Computing (CLOUD). Doi : 10.1109/CLOUD.2019.00088 [Google Scholar]
Aminudin (2018). Analisa Performa Apache Hadoop Dengan H2O Menggunakan Benchmark Hibench Via Cloud Computing. Vol 6, No 5, 2527-6042. doi: https://doi.org/10.22219/sentra.v0i4.2448. [Google Scholar]
Anveshrithaa, S., Lavanya K. (2020). Real-Time Vehicle Traffic Analysis using Long Short Term Memory Networks in Apache Spark. IEEE International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE). Doi : 10.1109/ic-ETITE47903.2020.97 [Google Scholar]
Assefi, M., Ehsun, B., Guangchi, L., and Ahmad, P. (2017). Big data machine learning using apache spark Mllib. IEEE International Conference on Big Data (Big Data). IEEE, pp. 3492–3498. doi: 10.1109.8258338. [Google Scholar]
Aziz, K., Dounia Z. and Mostafa B. (2019) Leveraging resource management for efcient performance of Apache Spark. Journal Big Data 8, 107. http://doi.org/10.1186/s40537-021-00499-7 [Google Scholar]
Chae, Suk-Joo, Tae-Sun Chung.. (2019). DSMM: A Dynamic Setting for Memory Management in Apache Spark. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Doi : 10.1109/ISPASS.2019.00024 [Google Scholar]
Chen Y, Goetsch P, Hoque MA, Lu J, Tarkoma S. (2019). d-simplexed: Adaptive delaunay triangulation for performance modeling and prediction on big data analytics. IEEE Trans. Big Data. Doi : 10.1109/TBDATA.2019. 2948338 [Google Scholar]
Gupta, Preeti, Arun Sharma, and Rajni Jinda. (2018). An Approach for Optimizing the Performance for Apache Spark Applications. IEEE International Conference on Computing Communication and Automation (ICCCA). Doi : 10.1109/CCAA.2018.8777541 [Google Scholar]
Hartawan, M., Susy, R., Abdul, H., Wulan, D., and Arman, S. P. (2022). Big Data (Informasi Dan Kasus). Tim Kun Fayakun: Jawa Timur. [Google Scholar]
Kapoor, Archit., Varun R. and Nirbhay K. (2020). Forecasting Daily Close Prices of Stock Indices using LSTM. IEEE International Conference on Advances in Computing, Communication Control and Networking. Doi : 10.1109/ICACCCN51052. 2020.9362756 [Google Scholar]
Karau, H., A. Konwinski, P. Wendell, and M. Zaharia. (2015). Learning Spark. O'Reilly Media, Inc. [Google Scholar]
Prabaswara, I., and Ragil, S. (2020). Implementasi Hadoop Dan Spark Untuk Analisis Penyebaran Demam Berdarah Dengue Berdasarkan Data Twitter. Vol. 4, No. 2, 164 – 171. doi : 10.25299/itjrd.2020.vol4(2).4099. [Google Scholar]
Ryanto, A. M., Ilham, A. A. & Niswar, M. (2018). Analisis Kinerja Framework Big Data Pada Cluster Tervirtualisasi: Hadoop Mapreduce dan Apache Spark. Makassar: Departemen Teknik Informatika Fakultas Teknik Universitas Hasanuddin. [Google Scholar]
Saputra, W., M. Hasrul. (2022). Analisis Throughput Pada Hadoop Menggunakan Algoritma Delay Scheduling Untuk Pengiriman 2 Job Yang Berbeda. Vol 7, No. 2, 2541-1179. doi: https://doi.org/10.24252/instek.v7i2.32578 [Google Scholar]
Santoso, Resky R., Rani M., Yudi A. (2020). Implementasi Metode Machine Learning Menggunakan Algoritma Evolving Artificial Neural Network Pada Kasus Prediksi Diagnosis Diabete. Vol 3, No 2.. [Google Scholar]
Satwika, I., Susila, H., and Swari, Made. (2020). Analisis Utilisasi Resource Clusters Pada Hadoop Menggunakan Virtualization. Vol. 3 No 1, 2598-7542. Doi: https://doi.org/10.31598. [Google Scholar]
Sun, X., Zhiyou O., and Dong Y. (2017). Short-term load forecasting based on multivariate linear regression. IEEE Transactions on Big Data Volume: 8, Issue: 2. Doi : 10.1109/TBDATA.2019.2948338 [Google Scholar]
Tohirin. (2020). Penerapan Keamanan Remote Server Melalui Ssh Dengan Kombinasi Kriptografi Asimetris Dan Autentikasi Dua Langkah. Vol.4, No.1, 2580-7927. doi: https://doi.org/10.36294/jurti.v4i1.1262. [Google Scholar]
Wang, Zhoukai., Yinliang Zhao. (2018). A Speculative Parallel Execution Model for Apache Spark. IEEE 9th International Conference on Software Engineering and Service Science (ICSESS). Doi : 10.1109/ICSESS.2018.8663838 [Google Scholar]
Wakde, A., Purvesh S., Sudarshan W., Shravani U., Ganesh D. (2018). Comparative analysis of Hadoop tools and Spark technology. IEEE Fourth International Conference on Computing Communication Control and Automation (ICCUBEA). Doi : 10.1109/ICCUBEA.2018.8697577 [Google Scholar]
Harlfoxem. (2015). House Sales in King County, USA. Kaggle. https://www.kaggle.com/datasets/harlfoxem/housesalesprediction [Google Scholar]
Neuroscience. (2021). Bitcoin Historical Data. Kaggle. https://www.kaggle.com/datasets/mczielinski/bitcoin-historical-data [Google Scholar]
Yahoo Finance. (2023). Comex. Retrieved April 3, 2023, from https://finance.yahoo.com/quote/GC=F [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.