Last week, I’ve finally gotten a good idea of the kind of analysis I want to perform on the data (extracted from CMS) and the statistical models to be used (linear regression and logistic regression). For convenient fitting of such models, we obviously require a table of the form <predictor 1> <predictor 2> …. <predictor n> <score> , where each record (row) represents a single training data for the regression model. This table can be obtained in CMS via a SQL query of the form
SELECT Judges.id, Players.id, Instruments.id, ... Scores.absolute_score FROM Judges, Players, Instruments, Performances ... Scores WHERE Judges.id = Performances.judges_id and Instruments.id = Performances.id and ...
This is a slightly complicated query, and involves joining several tables. Fortunately, the
query() function in
Wt::Dbo::Session is flexible enough to take this entire SQL string as input and output the desired table in the form of a collection of
boost::tuple objects. As I was writing the codes to perform this query in CMS, I ran into an error that I have not seen before:
A collection for '...' is already in use. Reentrant statement use is not yet implemented.
After a bit of googling, it turns out that it was a restriction of the library, and the original introductory tutorial for
Wt::Dbo actually mentioned this. I did not fully understand the explanation given in the tutorial, but by following the advice given there (copying out the results of the query into a
std::vector), I was able to avoid this error later on. It might be worthwhile for me to look into this further to gain a bit more understanding.
After adding the querying function (into CMS) to construct the joined table, I’ve spent the rest of the week implementing the classes that will interface with the CAS Database via
Wt::Dbo. I wanted to make sure that these classes are implemented in a flexible and easily understandable manner because there are likely more tables to be added to the database in the future (to store results of additional analyses, etc). Currently, I have implemented 9 classes that correspond to 9 tables in the database, namely AbsoluteScores (the result of the SQL query mentioned above), Performances, Events, Venues, Instruments, Scores, Players, Categories and Judges. AbsoluteScores consists of 8 foreign keys that points to the rest of the 8 tables. The rest of the tables do not contain any foreign keys, as their purpose are for storing miscellaneous information that are not used in analysis (eg. name, email, phone number, etc).
In the future, if more analysis needs to be incorporated into CAS, a key step is to be able to easily extract existing data from CAS database, and then perform experimentation on the extracted data (probably with a more statistics-friendly language, like R). Thus, ideally the extraction of data from CAS should be implemented in a flexible manner that involves minimal hardcoding. Motivated by this thought, I realize that it might be good to have a mechanism to easily iterate through all existing tables in a database. So, I defined an abstract base class Table which are inherited by all 9 classes (that correspond to the 9 tables in database). Within this abstract base class, I defined several template functions like
addRecord() to perform common database operations, as well as a few pure virtual functions like
mapClass(). There were some cases where I had to spend quite a bit of time contemplating whether I should use a template or virtual function, and I find that “Will this lead to downcasting in the future?” seems to be a very useful question to think about when doing class design.
In the upcoming week, I will first implement the REST API for importing data in the form of TSV into CAS. Then, I can start focusing on the exciting part of this project — analytics!