Last week, I finished implementing the functions for calculating precision and recall values of judge scores. The final set of calculation that I need to implement is the regression of judge scores on categorical factors.
- First, I grouped up the judge scores according to venue, year, and judge_id. After grouping, each group would look something like this
venue_id, year, judge_id, category_id, instrument_id, performance_id, absolute_score Penang, 2018, Judge-2, Under-12, Violin, 1, 80 Penang, 2018, Judge-2, Under-12, Violin, 2, 77 Penang, 2018, Judge-2, Young-Artiste, Cello, 3, 83 ...
- For each group, I perform two sets of regressions:
- regress on absolute_score using category_id as categorical predictor
- regress on absolute_score using instrument_id as categorical predictor
- The result of each of these regressions can be summarized in a table such as this (from R)
Call: lm(formula = absolute_score ~ instruments_id, data = data.raw.judge2) Residuals: Min 1Q Median 3Q Max -22.4426 -6.4426 0.5574 5.5574 15.5574 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 77.667 4.659 16.670 <2e-16 *** instruments_id2 -5.224 4.772 -1.095 0.2779 instruments_id3 -3.667 9.318 -0.393 0.6953 instruments_id4 -16.667 9.318 -1.789 0.0786 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.07 on 62 degrees of freedom Multiple R-squared: 0.05049, Adjusted R-squared: 0.00455 F-statistic: 1.099 on 3 and 62 DF, p-value: 0.3564
- Intercept is strongly significant, but this is not a useful information because it just means that the absolute_scores are significantly different from 0.
- The coefficient for instrument_id = 4 is midly significant. This is more interesting, because it would suggest that the absolute_scores awarded when instruments_id != 4 is significantly different from the absolute_scores awarded when instruments_id == 4. In this case, the estimated coefficient is -16.667, which means that absolute_scores awarded when instruments_id == 4 is significantly lower than absolute_scores awarded when instruments_id != 4.
- The adjusted R-squared of this regression model, 0.00455 is very low, which means that the model does not explain most of the variance in the dataset. But this is expected, because we know that absolute_scores are assigned by human judges. We cannot expect such a simplistic model to be able to predict human behavior. However, it is still useful to know which predictors are significant.
In my implementation, I used
RInsidelibrary, which allows me to call the
lm function in R using C++ wrapper classes provided by the library. Dr Shawn reminded me to be careful when writing my code to make sure that it is not vulnerable to R-injections. To achieve this, I made use of containers provided by the library such as
Rcpp::NumericMatrix containers to pass numeric data from C++ to R instead of passing them directly as a
std::string. The results of a regression a are stored in the database compactly as a JSON string, since it would not be really useful to split the p-values into separate fields.
After finished implementing the regression procedure, I spent some time to modify (again..) my admin-import-tsv function in CAS. My previous implementation could not support incremental import. Normally, if I only exported all tables in CMS database and import those tables into CAS, incremental import would not be difficult. All I needed to do is identify a natural primary key for each table (nric for players, guid for performances, instrument_name for instruments… ) and then I would be able to check whether a record already exists in the current CAS database during the import procedure. The algorithm was slighted more complicated because I was also exporting an additional table (computed) absolutescores from CMS which consist of foreign keys to other tables (in CMS). So, in my new implementation, I had to first map out the relationships between the CMS tables (in tsv) and the absolutescores table.