Fuchs ML Comparison Continued
Number of Points to Train
Before candidacy, I was working on exploring different Machine Learning Models with the Fuchs Model. I found that we only needed a really small amount of points to train the model. This was in part due to an error on my part. I wasn’t resetting the pytorch model parameters of the neural network in between each training split. To fix this, I just called
torch.save(model.state_dict(), ‘temp.pt’)
to save the initial untrained model state and then at the beginning of each training iteration, I called
model.load_state_dict(torch.load(‘temp.pt’))
to load this untrained model state. After doing this, I found that the neural network converged to a steady percentage error after around a few thousand data points of training
Where MSE and MAPE converges
In the process of creating these noisy datasets, we take the exact Fuchs model prediction for the max/total/average proton energies and add gaussian noise of a certain percentage. For example, 10% noise on some value $x$ would add some error value that is sampled from a normal distribution with mean $\mu = 0$ and standard deviation $\sigma = 0.10 x$. Since this noise is random, it cannot be predicted and it will place a lower bound on the mean squared error (MSE) and mean absolute percentage error (MAPE) we can obtain on noisy data.
MAPE
The gaussian noise we placed on the data was based off of a percentage. If we were using a root mean squared percentage error metric, we would simply find an optimal model has a percentage error equal to the amount of noise we used. However, the mean absolute percentage error will have a numerical factor of $\sqrt{2/\pi}$ difference which can be seen from taking the expectation value
For example, if we used 10% gaussian noise ($\epsilon = 0.10$), then we would predict a MAPE of 7.979%.
\begin{equation} MAPE = 100 \epsilon \sqrt{\frac{2}{\pi}} \end{equation}
MSE
MSE is not based on a percentage, unlike our noise, so we need to factor in some sense of scale for a MSE estimate. If we take a noiseless dataset, we can compute the square root of the mean squared max/total/average energies which will be our scale factor. This is $E_\text{scale} = {2.494315e-3, 7.925738e-4, 4.492056e4}$ for max, average, total energy respectively (using a dataset of 20,000 points generated from the Fuchs model and this would change if I changed the range of the inputs that we are scanning over). We multiply this number by the noise level (10% noise would be $\epsilon = 0.10$). Then we square this to get the MSE estimate.
\begin{equation} MSE = (\epsilon E_\text{scale} )^2 \end{equation}
Model Comparisons
I ran simulations comparing the neural network (NN), gaussian process (GP), and support vector regression (SVR) using a synthetic Fuchs dataset with 20,000 points first with 10% gaussian noise and then 20% gaussian noise. I am using the Owens P100 GPU which has 16GB of memory and is the default node on OSC. I compared the time to train the different models as
and the MSE and Percent Error comparisons are
We can see that the GP model trains much faster than the NN and SVR and has a similar (but slighlty higher) error. This is because I capped the maximum iterations of the GP regression to 20, which is stopping the training prematurely. This is why we might see a higher error, but also a lower training time. This suggests to me that the SVR and NN could be sped up by lowering the error tolerance threshold, but also that the GP could be optimized so that it does not need to go through its maximum number of iterations.
Chris suggested to me that a good metric might be training 100,000 data points in less than 100 seconds. Additionally, he also said that (for figuring out where the MSE converges) I should be directly comparing to the noiseless Fuchs dataset. To do this, I could just add a column to the Fuchs dataset that reports the noiseless energies in addition to the energies with noise.
Things to Do
- Try to optimize hyper-parameters of GP Model to minimize time for a given number of points
- Scale up this model to 100,000 data points to see if it can still be run feasibly on OSC
- Modify the Fuchs dataset to add the noiseless energy columns for easier MSE comparison
- Think about two types of comparisons given a 100,000 point training dataset: (1) Try to get under x amount of error no matter how much time it takes on 100,000 data points; (2) try to minimize error constrained to only run the model for less than 100 seconds.