Fuchs v2 Results, 03/11 Data
Lately, I have been working on updating the Synthetic Data paper. I have finished up most of what I wanted to change/improve so now we will mostly just be making edits. Additionally, I have been going to the lab in Dayton and slowly making improvements to the Data Acquisition System and learning how to handle the data.
New Synthetic Data results
I decided to redo a lot of the synthetic data results. Here are some of the main changes I implemented
- Performing a Grid Search of Hyperparameters to have a better justification for model architecture using 5 fold CV.
- Implementing a correction factor to account for the log-scaling bias
- Using the log-gaussian noise instead of gaussian noise
- Changing the neural network to the skorch architecture with early stopping and validation set
- Changing GPR and SVR to a chained output regressor so that we don’t have 3 completely independent models.
- Doing 40 data fraction splits instead of 20 to look at multiples of 500 training points (instead of 1000)
- Adding optimization plots that compare effectiveness of each model to predict a certain desired condition
Exploratory Data Analysis
I made a scatter plot of all 25000 points in the noiseless fuchs dataset to understand the dependence on maximum energy on the three inputs (the total and average energy behave similarly):
Here, we can see a roughly linear dependence with intensity, a negative quadratic dependence with focal distance, and inverse relationship with target thickness. Here, we can see that there are far more points with lower energy than there are with higher energy. Due to this, we suspect that the distribution of energies in the dataset is skewed, so I took the log of the output energies and plotted their distributions as a histogram:
Here, it is clear that the log of the energies are approximately normally distributed. This is the motivation behind the logarithmic scaling and in our tests it turned out to help the model to train much better.
Grid Searches
I used GridSearchCV which scans over all combinations of specified hyperparameters and trains with 5 fold Cross-validation and picks the set of hyper-parameters with best CV score (i.e. lowest mean squared error averaged between the 5 CV splits). I used a fuchs dataset with 10 percent added noise and 5000 training points. In this way, we are not picking a new set of hyperparameters for different noise levels and numbers training points when we get to the training step so that we can see how resilient these models are when adding/removing points or changing the noise.
The final Neural network architecure I settled on included the following parameters:
batch_size = 256
gamma = 0.95
(learning rate decay)n_hidden_layers = 3
n_neurons = 64
(per hidden layer)learning_rate = 1e-3
(initial learning rate)
and I used the Adam
optimizer, MSELoss
loss function, LeakyReLU
activation function because these have been working well for us for a while now. I used the default patience
of 5 for the early stopping function.
Using all these settings, I also made a figure comparing the effect of different model architectures (varying the number of hidden layers and number of neurons per layer)
Here, it shows that the (64x3) architecture not only has the best CV score, but also the smallest uncertainty in its score.
The final SVR architecture I settled on included the following parameters:
C = 2.5
(Regularization Parameter)epsilon = 1e-2
(Specifies tube within which no penalty is associated in loss function)tol = 1e-3
(Tolerance for stopping criterion)
and I used the defaults of gamma = scale
and kernel = rbf
.
The final GPR architecture I settled on included the following parameters:
num_epochs = 30
lr = 2e-1
The grid search initially yielded a learning rate of 5e-1
but I found that this learning rate was a bit too high in the training process for other noises/splits which caused some numerical instabilities.
So, I settled with the smaller learning rate which also has the added advantage of a lower mean uncertainty estimation (calculated point by point) from the GPR fit of around $8.4 \%$ (compared to $11-12 \%$ for the other parameters).
Below is the CV score comparison of the three models where we see the regression models fare better than the neural network model
Model Training
From a time perspective, the results did not differ much : The GPR still takes a very long time and the SVR trains extremely quickly
and the GPR still consumes a lot of GPU memory.
which can be mitigated by using approximate methods like the SVGP. As far as model performance goes, the grid searches were a good indicator because the GPR > SVR > NN for the MAPE on noiseless testing data:
The following plots have the dashed lines as the noiseless testing MAPE and the solid lines as the noisy testing MAPE. We can compare the accuracy with 500 training points and 20000 training points of the models
Additionally, I made a plot showing three levels of noise: 0, 15, and 30
where one useful feature about the NN seems to be that it gets better or stays the same with more noise. With little noise, it performs much worse than the other two.
Optimization
A big motivation for why the synthetic data training is important is because it can illustrate how a learned model can even output the noise and make good predictions. I designed an optimization computational experiment to compare the effectiveness of the models. I fixed the intensity to be the maximum and examined which points in the target thickness - focal distance space would have maximum proton energy closest to a specified cutoff but also laser to proton conversion efficiency (equivalently total proton energy in our model) to be as high as possible (so long as the maximum proton energy isn’t more than $15 \%$ away from the specified cutoff). The following plot illustrates the region in phase space where these conditions are satisfied
where the gray lines are the $15 \%$ cutoff, the red lines indicate the exact energy cutoff, and the green region only maximizes the total energy. The blue region weights both maximizing the total energy and also keeping the max energy near the cutoff and is primarily the region of interest. The green and blue regions are within 5 percent of the optimal value indicated with the green and blue stars.
If we did not train a ML model and just plotted 2,000 points worth of data (with 30 percent added noise) we would get a plot like this:
where the individual blue and green points fall within 5 percent of the blue and green stars. This plot is not very helpful, the predictions are all over the place because the high noise level and limited amount of data.
Instead, we can fit a ML model to 2,000 points of data which will smooth out the noise. Then, I made 100,000 prediction points from these trained models and evaluated the desired objective functions on each to make the plots seen below
Here, the GPR seems to perform the best and the NN worst as expected. Even so, all models still provide a much better estimate than the raw data.
To Do
- Learn about PyEpics and how to send commands with python
- Look through LabView files and see if I can make any modification to the looping programs
- Figure out how to deal with times possibly being off
- Make a program to append relevant data to h5 file where they are all aligned by shot num or time
- Use simplified version of MATLab code get rough estimate of conversion of Pixel Number to Energy (Bfield only) and see if this is a linear transformation. If so, maybe this conversion isn’t really necessary.