This week, I started working on the Competition Analysis System (CAS). Before actually building CAS, though, I needed to add a new functionality to the existing Competition Management System (CMS) to allow the admin to export useful data from the database in the form of tsv
.The exported data will then be transformed, restructured and finally imported into CAS for analysis.
To add the export functionality, I defined a new class AdminExportTSVResource, which is a derived class of the Wt::WResource
abstract base class. I also defined another class, CMSTSVExporter, to handle the parsing of data extracted from database into a output string stream. To implement these two classes, I needed utilize existing querying functions defined in the CMSDatabase class. As I was studying the codes written in CMSDatabase.cpp and CMSDatabase.h, I was rather shocked by the sheer volume of the implementation (there are close to 1000 lines of code in CMSDatabase.cpp!). Some member functions of the CMSDatabase class feels somewhat repetitive. For instance, there are several functions like Wt::Dbo::ptr<Judgings> readJudging(int id)
, Wt::Dbo::ptr<Players> readPlayers(int id)
, which have very similar behaviors, except that they are querying from a different table in the database. I couldn’t help but ask, could this have been implemented in a different manner, maybe using virtual functions, or by employing some design patterns? These findings are quite valuable for me, because they triggered me to start thinking about how I would design my classes in CAS that will interface with the database.
In the end, I managed to implement the two new classes and export the data from CMS into a tsv file using the -o cURL command (hi future interns, check this out if you’re a newbie to cURL like me 🙂 ). The next step, is to figure out how this data should be stored in CAS database for convenient access during analysis. This involves answering the question “What kind of algorithms will be used in CAS?”.
I find it difficult to give a specific answer to this question without actually experimenting on sample data. Fortunately, I was given a sample data (obtained from this year’s competition in Johor Bahru) by Dr Shawn. I ran some quick analysis in R and arrived at the following conclusions.
- In this project, model inference is far more important than predictive power of the model. For instance, questions such as “Are judge scores biased towards a certain instrument, or type of event?”, “Does the duration of performance have an impact on the score awarded?” are far more interesting than “Given the duration of performance, instrument used, and type of event, what is the score likely to be awarded?”. Due to this reason, simplistic models like linear regression and logistic regression models should be preferred over complex machine learning models (with very limited model inference capabilites) such as GBDTs, neural networks, SVMs, etc.
- To be concrete, suppose we fitted two models. Model 1 is a linear model of the form
score = β1 + β2*duration + β3*1{instrument == "guitar"}
(assuming only two types of instruments exist, for simplicity). Model 2 is a gradient boosted decision tree withinstrument
,duration
as the inputs, andscore
as the output. Both models can provide predictions ofscore
based usinginstrument
,duration
as predictors. Since Model 2 is has higher predictive power, it is expected to be able to provide much more accurate predictions than Model 1. However, it is quite difficult (but not impossible) to actually interpret the parameters of Model 2 if we want to know, which ofinstrument
,duration
are more ‘significant’ predictors. In Model 1, although the predictions are not as accurate Model 2, we can easily perform hypothesis tests onβ1
,β2
andβ3
to find out if any of them are statistically significant (judging from p-values). Say, for instanceβ3
is negative and statistically significant, we can then make a claim such as “On average, participants at the competition who performed with a guitar are awardedβ3
less points than those who performed with other instruments”.
To make the process of fitting linear regression or logistic regression models convenient, a table of the form <predictor1><predictor2>… <predictor n> <score> must be included in CAS’s database.
Since linear regression models will be playing a key role in CAS, I also did some googling to find out which C++ libraries are going to be most useful.
- mlpack contains implementations of model fitting of linear regression and logistic regression, but it appears that model inference is not being implemented.
- armadillo is a highly efficient linear algebra library. In the worst case, I could use it implement my own linear regression.
Fortunately, both mlpack and armadillo are already installed and available in the office’s PC.
That’s all for now, I can’t wait to continue working on this project next week 🙂 .
0 Comments