Yusuf's Portfolio

Data Science & Machine Learning Projects

View the Project on GitHub


👋 Introduction

ye_cover-photo

Welcome, and thank you for stopping by my data science and machine learning portfolio!

I’m Yusuf, a part-time analytics postgraduate student at Georgia Tech and an aspiring applied data science and machine learning professional. This space brings together projects I’ve worked on through academic research, collaborations with universities, and self-guided learning.

These projects reflect my experience working with real-world data, where I’ve applied data science and machine learning techniques such as statistical analysis, regression modelling, and time series forecasting to uncover meaningful insights. While much of my work has been shaped by research-driven challenges in academic settings, I’ve also taken the initiative to explore ideas independently, building complete workflows from data preparation to model evaluation. The skills I’ve developed are highly transferable and relevant to any field where data plays a central role, and I’m always eager to explore new problem spaces and continue learning through hands-on work.

Thanks again for visiting, and I hope you find something here that interests you. Whether you’re a hiring manager, data team lead, or just curious, I’m glad you’re here!


🔍 What You’ll Find Here


🗂️ Project Index

Click a project to jump to its section:


Rain-Net: Daily Rainfall Forecasting (2025 - ongoing)

🔍 Overview

rain-net_cover-photo

Rain-Net is an ongoing research collaboration with Sunway University, focused on developing a machine learning framework to forecast daily rainfall. Due to confidentiality, only selected aspects of the project are shared here as this is still an ongoing work and yet to be published. Project code and raw data will not be shared, but key methodologies and results are summarised below.

The aim is to build a predictive model that can deliver rainfall forecasts with useful accuracy even in constrained settings.

Daily rainfall
Rain-Net Figure 1: Daily rainfall data over time, showcasing intermittent but intense rainfall spikes, with several pronounced peaks suggesting periods of extreme weather events. Most days experienced little to no rainfall, highlighting the sparse yet heavy nature of the rainfall distribution. (x and y axes are removed due to confidentiality requirements)

📊 Data & Features

Given the dataset’s limitations, heavy emphasis was placed on feature engineering to enrich the information available to the models. The following features are engineered from the original univariate dataset (consisting only of historical daily rainfall):


🧪 Exploratory Data Analysis (EDA)

A detailed EDA was conducted to understand the dataset’s structure and behaviour:

Histogram of daily rainfall
Rain-Net Figure 2: Histogram of daily rainfall showing a strong right skew, with most days experiencing low or no rainfall and fewer days with high rainfall amounts. This highlights the typical pattern of rainfall events being infrequent but occasionally intense. (x and y axes are removed due to confidentiality requirements)
Boxplot of daily rainfall
Rain-Net Figure 3: Boxplot showing most data points are tightly clustered near the lower end of the scale, with a long tail and many outliers indicating extreme rainfall events.
Violinplot of daily rainfall
Rain-Net Figure 4: The violin plot shows a sharp peak near 0mm, reflecting the frequency of dry or light rainfall days, with a dense but slim distribution extending towards high rainfall values.
ACF and PACF of daily rainfall
Rain-Net Figure 5: ACF and PACF plots show significant short-term autocorrelation, supporting the use of up to 7 lag days as predictive features.

🧠 Methods & Models

Catboost SHAP readings
Rain-Net Figure 6: SHAP summary plot for the CatBoost model showing that the previous-day rainfall indicator, cyclical features, and short-term temporal features (e.g. previous 1 day rainfall, 14-day moving average) have the highest influence on model predictions, while longer-term or variability-based features have lower impact.

📈 Results & Evaluation

While the overall performance still leaves room for improvement, CatBoost currently stands out as the best-performing model among those tested. It provides relatively lower error metrics across the training, validation, and test sets, and handles the general rainfall patterns better than other models. Despite a low NSE on the test set, which is expected due to data limitations and high variability, CatBoost remains the most promising model in this study so far.

Catboost train results
Rain-Net Figure 7: In the training set, CatBoost closely matches actual rainfall values across a wide range of conditions. High rainfall events are well captured, suggesting strong model fit. However, care should be taken when evaluating performance on unseen data to ensure generalisability. (x and y axes are removed due to confidentiality requirements)
Catboost val results
Rain-Net Figure 8: On the validation set, the CatBoost model successfully follows the general pattern of rainfall, especially during low to moderate rainfall days. While major spikes are present in the actual data, the model captures their timing but often underestimates their magnitude. (x and y axes are removed due to confidentiality requirements)
Catboost test results
Rain-Net Figure 9: CatBoost rainfall forecasting on the test set shows strong alignment between predicted and actual values during dry periods, with reasonable tracking of rainfall trends overall. Peak rainfall events tend to be underpredicted, highlighting the model's difficulty in capturing extremes. (x and y axes are removed due to confidentiality requirements)

🛠️ Tools & Libraries


💡 Key Takeaways


🔄 Ongoing Work


VersusAI: Monte Carlo Tree Search Variant Performance Prediction (2024)

⚙️ This project is in progress! I’m piecing it together and digging through my old work like a data archaeologist. Just need a bit more time to get everything organised and properly displayed here. In the meantime, feel free to explore my work on the Rain-Net and FlowTrack projects!

🔍 Overview

Coming soon…

📊 Data & Features

Coming soon…

🧪 Exploratory Data Analysis (EDA)

Coming soon…

🧠 Methods & Models

Coming soon…

📈 Results & Evaluation

Coming soon…

🛠️ Tools & Libraries

Coming soon…

💡 Key Takeaways

Coming soon…


FlowTrack: River Streamflow Forecasting (2022)

🔍 Overview

flowtrack_cover-photo

FlowTrack is a completed machine learning research project conducted under a collaboration between Universiti Tunku Abdul Rahman and Universiti Tenaga Nasional, focused on forecasting daily river streamflow across multiple rivers in Peninsular Malaysia. The study addresses a critical research gap: whether a single model can generalise effectively across diverse river systems.

The overall findings have been published in Scientific Reports by Nature and can be accessed here. Due to confidentiality requirements, project code and raw data will not be shared, but key methodologies and results are summarised below.

FlowTrack Table 1: Geographical and administrative information of river stations used in this study, including coordinates, station IDs, and data coverage period.
River info

📊 Data & Features

The study focuses on univariate time series forecasting, where only past streamflow values are originally available to be used as inputs. Based on statistical analysis and autocorrelation studies, lagged streamflow values were selected as predictors.

At the time of conducting this research, the focus was kept on a minimal univariate setup using lagged features. While more advanced feature engineering might have improved model performance, such as including seasonality indicators, rolling statistics, or rainfall as an additional input, my understanding of these techniques was still in progress. Looking back, incorporating these could have offered better temporal context and generalisation, especially for rivers with highly variable flow. This is something I’ve started exploring and implementing in more recent projects.


🧪 Exploratory Data Analysis (EDA)

A thorough EDA was conducted to understand streamflow behaviour and inform model design:

FlowTrack Table 2: Summary statistics for each river’s streamflow data, highlighting variability in mean, spread, and extreme values across river systems.
Summary statistics
River Pacf
FlowTrack Figure 1: Partial autocorrelation plots for all rivers indicate strong short-term dependencies, particularly at lag-1, lag-2, and lag-3, supporting the chosen input scenarios.
FlowTrack Table 3: Pearson correlation coefficients showing strong relationships between current and lagged streamflow values across selected rivers, justifying the use of lags as input features.
Pearson correlation
Sample imputation
FlowTrack Figure 2: Example of missing value imputation using linear interpolation for Sungai Johor. Red lines indicate imputed values where known streamflow data (blue) was unavailable.

🧠 Methods & Models

FlowTrack Table 4: Partitioning of river streamflow datasets into training and testing periods for each river. The split varies by river and reflects long-term daily data availability, however the ratios remain the same across all rivers (80% training and 20% testing).
Data partitioning

📈 Results & Evaluation

Among the 99 models tested across 11 river datasets, the ANN3 model (Artificial Neural Network with Scenario 3 input lags) consistently emerged as the best performer. It ranked first in 4 out of 11 rivers (Sungai Johor, Sungai Pahang, Sungai Arau, Sungai Selangor) and achieved the second-best average RM score (3.27), just marginally behind ANN2 (3.21). Despite this, ANN3 is selected as the best overall model due to its superior ability to produce top-performing forecasts across more rivers.

FlowTrack Table 5: RM scores for all models across each river dataset. ANN3 achieves the best RM in 4 datasets, with ANN models dominating overall rankings.
RM table
RM bar chart
FlowTrack Figure 3: Average RM scores for each model and scenario, highlighting ANN2 and ANN3 as top-performing models across river datasets.
Sg Johor graph
FlowTrack Figure 4: Forecast comparison on Sungai Johor test set. ANN3 captures sharp spikes more effectively than SVM and LSTM.
FlowTrack Table 6: Performance of all models on Sungai Johor. ANN3 achieves top scores across all evaluation metrics.
Sg Johor table results
Sg Pahang graph
FlowTrack Figure 5: Actual vs predicted streamflow for Sungai Pahang using the best models from each algorithm. ANN3 closely follows actual peaks and troughs.
FlowTrack Table 7: Evaluation results for Sungai Pahang dataset. ANN3 again demonstrates leading performance across MAE, RMSE, and R².
Sg Pahang table results

🛠️ Tools & Libraries


💡 Key Takeaways


🚧 Room for Improvement


SediSense: Suspended Sediment Load Forecasting (2022)

⚙️ This project is in progress! I’m piecing it together and digging through my old work like a data archaeologist. Just need a bit more time to get everything organised and properly displayed here. In the meantime, feel free to explore my work on the Rain-Net and FlowTrack projects!

🔍 Overview

Coming soon…

📊 Data & Features

Coming soon…

🧪 Exploratory Data Analysis (EDA)

Coming soon…

🧠 Methods & Models

Coming soon…

📈 Results & Evaluation

Coming soon…

🛠️ Tools & Libraries

Coming soon…

💡 Key Takeaways

Coming soon…

SolarCast: Photovoltaic Solar Power Prediction (2022)

⚙️ This project is in progress! I’m piecing it together and digging through my old work like a data archaeologist. Just need a bit more time to get everything organised and properly displayed here. In the meantime, feel free to explore my work on the Rain-Net and FlowTrack projects!

🔍 Overview

Coming soon…

📊 Data & Features

Coming soon…

🧪 Exploratory Data Analysis (EDA)

Coming soon…

🧠 Methods & Models

Coming soon…

📈 Results & Evaluation

Coming soon…

🛠️ Tools & Libraries

Coming soon…

💡 Key Takeaways

Coming soon…