The goal of this tutorial is to familiarize attendees with large-scale data processing and analysis using a tool such as Apache Spark. We will introduce the keys to MapReduce processing, include an introduction to the use of Spark's two main APIs (RDDs and DataFrames), to Spark's MLlib library and discuss how to design solutions for deploying Machine Learning models in this context. As case studies, we will show examples of how to deploy large-scale models for massive image segmentation, and a solution for ensemble learning. All this in a practical way using Python and Jupyter Notebooks.


Table of contents

  • Introduction to massive data analysis (Big Data)
  • Programming paradigms (MapReduce) and tools (Hadoop/Spark/Dask)
  • High-performance programming with Spark RDD and Spark DataFrames
  • Machine Learning with massive data
  • Introduction to Apache Spark's MLlib
  • How to design your own solution: local vs. global approach
  • Case studies: Ensemble Learning in Big Data and Deployment of Deep Learning Models for Image Segmentation
  • Conclusions

Submission & formatting

The instructions for the preparation of the works and the conditions of the different types of contributions are the same as those of the CAEPIA conference, and are available at

Submission of papers for the workshop must be done through the CAEPIA EasyChair, available at Easychair , using this workshop track.

Waiting for your contributions!

Download the CfP