Last week, I finished implementing the functions for calculating precision and recall values of judge scores. The final set of calculation that I need to implement is the **regression of judge scores on categorical factors**.

- First, I grouped up the judge scores according to
**venue**,**year**, and**judge_id**. After grouping, each group would look something like this**venue_id, year, judge_id, category_id, instrument_id, performance_id, absolute_score**Penang, 2018, Judge-2, Under-12, Violin, 1, 80 Penang, 2018, Judge-2, Under-12, Violin, 2, 77 Penang, 2018, Judge-2, Young-Artiste, Cello, 3, 83 ... - For each group, I perform two sets of regressions:
- regress on absolute_score
**using category_id as categorical predictor** - regress on absolute_score
**using instrument_id as categorical predictor**

- regress on absolute_score
- The result of each of these regressions can be summarized in a table such as this (from R)
Call: lm(formula = absolute_score ~ instruments_id, data = data.raw.judge2) Residuals: Min 1Q Median 3Q Max -22.4426 -6.4426 0.5574 5.5574 15.5574 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 77.667 4.659 16.670 <2e-16 *** instruments_id2 -5.224 4.772 -1.095 0.2779 instruments_id3 -3.667 9.318 -0.393 0.6953 instruments_id4 -16.667 9.318 -1.789 0.0786 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.07 on 62 degrees of freedom Multiple R-squared: 0.05049, Adjusted R-squared: 0.00455 F-statistic: 1.099 on 3 and 62 DF, p-value: 0.3564

- Intercept is strongly significant, but this is not a useful information because it just means that the
**absolute_scores are significantly different from 0**. - The coefficient for instrument_id = 4 is midly significant. This is more interesting, because it would suggest that the
**absolute_scores awarded when instruments_id != 4**is*significantly different from*the**absolute_scores awarded when instruments_id == 4**. In this case, the estimated coefficient is -16.667, which means that**absolute_scores awarded when instruments_id == 4**is*significantly lower*than**absolute_scores awarded when instruments_id != 4.** - The adjusted R-squared of this regression model, 0.00455 is very low, which means that the model does not explain most of the variance in the dataset. But this is expected, because we know that absolute_scores are assigned by human judges. We cannot expect such a simplistic model to be able to predict human behavior. However, it is still useful to know which predictors are
**significant.**

- Intercept is strongly significant, but this is not a useful information because it just means that the

In my implementation, I used `RInside`

library, which allows me to call the `lm`

function in R using C++ wrapper classes provided by the library. Dr Shawn reminded me to be careful when writing my code to make sure that it is not vulnerable to R-injections. To achieve this, I made use of containers provided by the library such as `Rcpp::NumericVector`

and `Rcpp::NumericMatrix`

containers to pass numeric data from C++ to R instead of passing them directly as a `std::string`

. The results of a regression a are stored in the database compactly as a JSON string, since it would not be really useful to split the p-values into separate fields.

After finished implementing the regression procedure, I spent some time to modify (again..) my **admin-import-tsv** function in CAS. My previous implementation could not support incremental import. Normally, if I only exported all tables in CMS database and import those tables into CAS, incremental import would not be difficult. All I needed to do is identify a natural primary key for each table (**nric** for players, **guid** for performances, **instrument_name** for instruments… ) and then I would be able to check whether a record already exists in the current CAS database during the import procedure. The algorithm was slighted more complicated because I was also exporting an additional table (computed) **absolutescores** from CMS which consist of foreign keys to other tables (in CMS). So, in my new implementation, I had to first map out the relationships between the CMS tables (in tsv) and the **absolutescores** table.

In the upcoming week, I will need to add some code to CRS to export data from the **registrants** table (since the current export does not provide that). Dr Shawn also told me that I needed to inspect the section of CRS which handles input data from users because we need to make sure that the format of certain fields (such as **nric**) are standardized. After that, I will focus on writing the javascript for generating visualizations for the results of calculations done in CAS.