Regression via RInside, incremental import

Last week, I finished implementing the functions for calculating precision and recall values of judge scores. The final set of calculation that I need to implement is the regression of judge scores on categorical factors.

First, I grouped up the judge scores according to venue, year, and judge_id. After grouping, each group would look something like this

 venue_id, year, judge_id, category_id, instrument_id, performance_id, absolute_score
 Penang, 2018, Judge-2, Under-12, Violin, 1, 80
 Penang, 2018, Judge-2, Under-12, Violin, 2, 77
 Penang, 2018, Judge-2, Young-Artiste, Cello, 3, 83
 ...

For each group, I perform two sets of regressions:
- regress on absolute_score using category_id as categorical predictor
- regress on absolute_score using instrument_id as categorical predictor
The result of each of these regressions can be summarized in a table such as this (from R)
```
Call:
lm(formula = absolute_score ~ instruments_id, data = data.raw.judge2)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.4426  -6.4426   0.5574   5.5574  15.5574 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       77.667      4.659  16.670   <2e-16 ***
instruments_id2   -5.224      4.772  -1.095   0.2779    
instruments_id3   -3.667      9.318  -0.393   0.6953    
instruments_id4  -16.667      9.318  -1.789   0.0786 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.07 on 62 degrees of freedom
Multiple R-squared:  0.05049,	Adjusted R-squared:  0.00455 
F-statistic: 1.099 on 3 and 62 DF,  p-value: 0.3564
```
- Intercept is strongly significant, but this is not a useful information because it just means that the absolute_scores are significantly different from 0.
- The coefficient for instrument_id = 4 is midly significant. This is more interesting, because it would suggest that the absolute_scores awarded when instruments_id != 4 is significantly different from the absolute_scores awarded when instruments_id == 4. In this case, the estimated coefficient is -16.667, which means that absolute_scores awarded when instruments_id == 4 is significantly lower than absolute_scores awarded when instruments_id != 4.
- The adjusted R-squared of this regression model, 0.00455 is very low, which means that the model does not explain most of the variance in the dataset. But this is expected, because we know that absolute_scores are assigned by human judges. We cannot expect such a simplistic model to be able to predict human behavior. However, it is still useful to know which predictors are significant.

In my implementation, I used RInsidelibrary, which allows me to call the lm function in R using C++ wrapper classes provided by the library. Dr Shawn reminded me to be careful when writing my code to make sure that it is not vulnerable to R-injections. To achieve this, I made use of containers provided by the library such as Rcpp::NumericVector and Rcpp::NumericMatrix containers to pass numeric data from C++ to R instead of passing them directly as a std::string. The results of a regression a are stored in the database compactly as a JSON string, since it would not be really useful to split the p-values into separate fields.

After finished implementing the regression procedure, I spent some time to modify (again..) my admin-import-tsv function in CAS. My previous implementation could not support incremental import. Normally, if I only exported all tables in CMS database and import those tables into CAS, incremental import would not be difficult. All I needed to do is identify a natural primary key for each table (nric for players, guid for performances, instrument_name for instruments… ) and then I would be able to check whether a record already exists in the current CAS database during the import procedure. The algorithm was slighted more complicated because I was also exporting an additional table (computed) absolutescores from CMS which consist of foreign keys to other tables (in CMS). So, in my new implementation, I had to first map out the relationships between the CMS tables (in tsv) and the absolutescores table.

In the upcoming week, I will need to add some code to CRS to export data from the registrants table (since the current export does not provide that). Dr Shawn also told me that I needed to inspect the section of CRS which handles input data from users because we need to make sure that the format of certain fields (such as nric) are standardized. After that, I will focus on writing the javascript for generating visualizations for the results of calculations done in CAS.

Regression via RInside, incremental import

Published by Wen Yan on 2018-08-162018-08-16

0 Comments

Leave a ReplyCancel reply

Week 2: Learning Through Challenges

Week 1: Starting Small, Learning Big

Week 24: Wrapping Things Up (And Saying Goodbye)

Regression via RInside, incremental import

Published by Wen Yan on 2018-08-162018-08-16

0 Comments

Leave a ReplyCancel reply

Related Posts

Week 2: Learning Through Challenges

Week 1: Starting Small, Learning Big

Week 24: Wrapping Things Up (And Saying Goodbye)