Log-Scaling Correction, AFRL Data

Lately, I have been traveling to AFRL once or twice a week to work on data acquisition stuff. We worked on improving the routine data collection procedure so that it reproducible, collecting and analyzing some data, syncing the clocks of the computers, expanding the partition on the beaglebones, and modifying how we save all the data. Additionally, I have done some work on the Synthetic data paper implementing a log-scaling bias correction factor and doing grid searches and model trainings to confirm that the improved models work better. In the coming weeks, I will be continuing to work on the data acquisition stuff as well as the synthetic data paper.

Log-Scaling Correction

Log-Scaling Correction Derivation

As I’ve mentioned previously, the log-scaling skews the predictions of a ML model. In fact, it always makes the ML model under-predict the true function. The log normal distribution (see wikipedia page) is what I am using to add noise to the output energies. The relevant formula is

\begin{equation} \mu = \log\left(\frac{\mu_X^2}{\sqrt{\mu_X^2 + \sigma_X^2}}\right) \end{equation}

Here, $\mu_X$ would be the value of the output energy. After log-scaling, the model will ideally learn the output energy to be $\mu$. When we exponentiate the model prediction, we will be left with $\exp(\mu) < \mu_X$. This is the origin of the under-prediction

The PDF of the log-normal distrubution is

\begin{equation} f(x) = \frac{1}{x \sigma \sqrt{2 \pi}} \exp\left(- \frac{(\log(x) - \mu)^2}{2 \sigma^2} \right) \end{equation}

where $\mu = \log\left(\frac{\mu_X}{\sqrt{1 + \alpha^2}} \right)$ is the mean of the logged distribution, $\sigma^2 = \log(1 + \alpha^2)$ is the variance of the logged distribution, and $\alpha = \sigma_X / \mu_X$ is the noise level of the original (log-normal) distribution. Additionally, $\mu_X$ and $\sigma_X$ are the mean and variance of the log-normal distribution.

The expected value $E[x] = \mu_X$ can be confirmed by taking $\int_0^\infty x f(x)$ (I just used Mathematica). To get the expected value of the log of this distribution, we can take $\int Log(x) f(x) = \log\left(\frac{\mu_X}{\sqrt{1 + \alpha^2} } \right) = \mu$ which makes sense due to our definition of $\mu$. We can calculate the underprediction bias as a percentage as

\begin{equation} \text{bias} = \frac{\mu_X - \mu^*}{\mu_X} \times 100 \% = \frac{\sqrt{1 + \alpha^2} - 1}{\sqrt{1 + \alpha^2}} \times 100 \% \label{eq:bias} \end{equation}

where $\mu^* \equiv e^\mu$

This bias is something that is discussed online. A simple explanation can be given in this Medium article. An academic article is given by Miller 1984

Techniques to Implement Prediction Correction

To fix this underprediction bias, we would want to multiply the predictions by a correction factor equal to $\mu_X / \mu^* = \sqrt{1 + \alpha^2}$. However, this comes with two assumptions.

  1. The noise in our experiment is distributed log-normally
  2. We know the constant noise level $\alpha$ exactly.

This multiplicative factor of $\sqrt{1 + \alpha^2}$ should work for our synthetic data, but may be difficult to apply to real experimental data where the amount of noise and noise model may not be known exactly or hard to determine. However, the important thing is that in this process we are developing a correction factor that is equal to the ratio of two quantities: (1) the mean of the un-logged distribution, (2) the exponential of the mean of the logged distribution. We can determine (1) from the experimental dataset and (2) from the training predictions on the experimental dataset. I explored three ways to come up with an effective correction factor $C$. I will refer to the predictions as $y_P$ and the data as $y_D$.

  1. C = mean($y_D$) / mean($y_P$)
  2. C = mean($y_D / y_P$)
  3. Perform a linear least-squares fit $y_D = C y_P$ (setting the y-intercept to 0) to determine $C$

All three of these methods performed very similarly:

image

The black dashed line shows the log-scaling bias from eq. \ref{eq:bias}. The gray dashed line shows the MAPE between the polynomial predictions and the exact energy (which is clearly affected by the log-scaling bias).

In red, green, blue, we have the MAPE between the corrected polynomial predictions and exact energy from methods 1, 2, and 3 respectively. I plotted them with different line widths so that they are all visible together due to overlapping entirely. All of them perform pretty much the same and are able to have much lower MAPEs than the log-scaling bias curve. By the way, I developed the correction factor on the training set only, and used that correction factor to correct the testing set predictions for this plot.

To assess which ones if any are better, I looked at all 420 comparisons with the testing set (7 noise levels from 0 to 30, 3 energy outputs of max total average, 20 different training data points from 1000-20000) and counted which of the three methods had the lowest MAPE. Method 1 won 133 times, Method 2 won 206 times, Method 3 won 81 times. If I convert this to a percentage this is

  1. $31.7 \%$
  2. $49.0 \%$
  3. $19.3 \%$

So, it appears from this analysis that method 2 is a sensible choice to stick with. And, to me, it makes the most sense because it is averaging data point by data point this bias ratio.

WPAFB Beginning

I should be looking into implementing some kind of feedback loop so we can get going with the machine learning. I would need to do the following things:

  1. Implement some sort of looping program that loops through relevant parameters of interest. For now, the target data seems to be harder to maniuplate and Kyle says that the PZT position is something that should get written to file that isn’t written to the h5 files yet.
  2. Given a looping program, train a model on the data to find some optimal condition: like electron spectra. We can condense this down to a pixel number weighted by counts (an average energy of sorts). Maybe the easiest things to vary would be the pre-pulse delay although it is not being stored as an h5 file and just as a text file. So maybe, the pre-pulse intensity should just be varied instead to find the optimal pre-pulse as an optimization problem.
  3. Figure out how to take the results of the trained model and feed it back into EPICS.

Things to Do

  • Learn how PyEpics works
  • Learn how Nathaniel designed his looping LabView program and make a program of my own
  • Learn how to append h5 files into another h5 file instead of a pandas df. Should only extract the weighted pixel number from the Mightex data, otherwise data file will be too big.
  • Need to change Espec and Pspec VM’s time to be synced with XL-to-Go
  • Should store all files in the same directory (with the date string as a name). Can differentiate the files by the file prefix (e.g. DIODE or MIGHTEX).
  • Look at the MatLab code for converting pixel number to energy
Written on March 1, 2024