Getting started with analytics

Last week, I was working on the database section of CAS. It is surely not perfect yet, in terms of class design, but sufficient for me to move on to my next tasks. I spent about one and a half day implementing the classes (CASTSVImporter, AdminImportTSVResource) that will handle the import and string parsing of TSV files exported from CMS database.

I then spent some time to write down in detail, all of the analysis and calculations that I intend to perform. These statistics/plots all try to derive insights from the absolute scores given by judges

Standardization of absolute scores (Preprocessing step, results are used internally, not presented directly to public)
- The i-th standard score assigned by the j-th judge can be obtained from standard_score[i, j] = (absolute_score[i, j] - mean(absolute_score[: , j])) / sqrt(variance(absolute_score[: , j]))
General plots to convey the big picture (Targeted to general audience, useful for anyone)
- Data points in these plots will not be labeled, because the purpose is to illustrate the distribution of scores with respect to different factors, but not to display the standard scores of each performer.
- Boxplots:
  - Standard scores (continuous) against Judges (categorical)
  - Averaged standard scores (continuous) against Instruments (categorical)
  - Averaged standard scores (continuous) against Competition Categories (categorical)
- Histogram (of):
  - Standard scores given by a certain judge
  - Averaged standard scores given to performers who used a certain instrument
  - Averaged standard scores given to gold, silver, and bronze prize winners
  - Averaged standard scores by a specific venue/event
- Each of these plots can be based on data from a specific year / all historical data
Distribution of prizes awarded to participants (Most useful for event organizers)
- Motivation: To provide a prediction of the proportion of gold/silver/bronze winners of a specific event.
- One way to do this is to assume that the proportion of prize winners at a certain event follows a discrete probability distribution, such as the multinomial distribution, which would have 4 parameters (p_gold, p_silver, p_bronze, p_{participation}). The parameters can be estimated from historical data (using MLE, and for multinomial distribution, the MLE is just the sample average). We can then calculate the 95% confidence intervals of these parameters, which can be interpreted as “Assuming that the distribution is correct, then, in the most extreme cases, how high/low can the proportion of gold/silver/bronze prize winners go?”.
- This approach assumes that the distribution of prize winners at an event stay constant over time, which may not be true. A time series approach might be able to account for the changes in proportion of gold/silver/bronze winners over time, but this approach is only sensible if there is sufficiently long historical data ( > 10 years).
- For now, I think it would sufficient to simply provide a plot of the proportion of gold/silver/bronze prize winners over time.
Percentile calculation (Most useful for participants and registrants)
- For each performance, calculate the percentile of the performer’s averaged standard score
  - Among all performances within the same year
  - Among all performances within the same year and category
  - Among all performances within the same year, category and event
  - Among all performances from all years
  - Among all performances from all years, but same category
  - Among all performances from all years, but same category and event
Bias of judge scores towards other factors (Most useful for organizers)
- I already planned to use linear regression and logistic regression for this purpose, but currently I am still considering what kind of visualization would be best for presenting such results. A scatterplot might be good for visualizing the significance of an interaction in a regression model.

Dr Shawn helped me install two libraries, armadillo and RInside. After some experimenting, I found armadillo to be very useful for general matrix computations. Its syntax rules are actually quite similar to that of numpy in python. RInside is a basically library which allows the programmer to use R “inside” C++ code by providing a C++ wrapper class with an R interpreter. If linear regression and logistic regression turns out to be too difficult for me to implement (and no other libraries are available), I may have to resort to using RInside to perform the model fitting and model inference procedures.

After familiarizing myself with armadillo, I started implementing the functions to calculate and store standard scores in CAS. Since armadillo already provides functions to efficiently calculate variance and mean, I was able to implement it with relatively few number of lines of code. Next, I started writing the codes to calculate percentiles, which turns out to be a little more complicated than what I first imagined. I relied on the group by and order by statements in SQL to help me obtain the relevant set of standard scores in sorted order. Nevertheless, my first version of the function was still very long and repetitive. I am currently on my second attempt, and I am trying to make use of STL containers like std::unordered_map to hopefully make the calculation a little more efficient (in terms of complexity).

As I implement these calculations, I am also constantly updating the CASDatabase class by adding new query functions into it. I am starting to realize what is useful and what can be discarded from my initial class designs. From now on the Wt::Dbo sections of CAS will still need to be updated and improved regularly because there are certainly some issues that I did not account for when I first implemented it.

This week, Dr Shawn reminded me of the importance of adhering to the Git Workflow at all times, even if I am the only one working on the project. Recently, I have gotten a bit lazy on following the Git Workflow. Dr Shawn told me that this is a bad software engineering practice, and will not serve me well when I work in the software industry in the future. So, I need to be more conscious about this from now on and make sure I that don’t repeat my mistakes.

Getting started with analytics

Published by Wen Yan on 2018-07-272018-07-27

0 Comments

Leave a ReplyCancel reply

Internship – Week 4

Internship – Week 3

Internship – Week 2

Getting started with analytics

Published by Wen Yan on 2018-07-272018-07-27

0 Comments

Leave a ReplyCancel reply

Related Posts

Internship – Week 4

Internship – Week 3

Internship – Week 2