News & Blog

Winner Solution for Mtoto News Childline Kenya Call Volume Prediction Challenge

The objective of this competition by Zindi Africa was to create a forecast model to predict the number of incoming calls that Childline Kenya will receive per hour per day.  This would help Childline Kenya manage its resources, staff, and their call center more efficiently hence enabling them to better serve as many children as possible.

Approach

Since observations were made sequentially with time, I treated the challenge as a time series problem using time series features. First I had to plot the series and check for stationarity (constant mean, constant variance and whether its autocovariance was not dependent on time). I also considered various components of a Time series like trend and seasonality.

Data transformations:

After plotting the series, I began by dropping values which were above a certain standard deviation thinking that they were outliers but I later realized that most of these values occurred in July and the series had just started taking a new trend (upward). This affected my score (with RMSE of about 30 on the public leaderboard).The next thought was to look for a way to reduce the trend. I did log transformation in order to penalize these higher values to enable me implement statistical forecasting techniques on the series. Although this data transformation didn’t result to a better score with XGBOOST, it was doing fine with ARIMA model. I then sought an alternative approach to working with this series without eliminating these higher values, which was data splitting.  The approach involved performing a side-by-side split on these higher values (which occurred in July and split them into 2). I then set my training dataset to consist of all values from January and one split from July and tested the model on the remaining split.

Data modeling

The main models were the ARIMA, Random Forest, and an XGBOOST. However,  XGBOOST remained robust throughout the competition! This is probably because it allows cross-validation at each iteration of the boosting process. (I used data without any transformation on it while fitting XGBOOST)

Performance Evaluation

I used mean absolute percentage error (MAPE) to evaluate my forecast accuracy. I chose MAPE due to its advantages of scale-independency and interpretability and most values were not zeros or close-to-zeros from the trend observed previously.

Conclusion

The performance of the model could be improved by proving more data for training. We only had training data for 6 months which means we couldn’t capture important time series components like yearly, seasonality and cyclic variation. More data would also help us utilize holiday data and school times data provided at Zindi. Also I couldn’t use weather data provided because I thought that if you would have other features (that are not time related) then those features should be readily available to you at the moment predictions will be made. That is, whenever you try to predict a certain value 2 weeks in advance, it might be possible that certain information might not yet be available e.g temperature.

The code is available on Github through the button below

 

About Author

Lawrence Moruye  is a 4th year student at Multimedia University of Kenya taking a bachelor’s degree in mathematics and computer science with interest in the field of AI and Data Science. I use machine learning to solve some of the challenges in Finance, Agriculture and Fraud Detection. I am looking forward to secure an internship in a data science or AI sector from April, 2020.