Radiation Protection Dosimetry Advance Access originally published online on January 12, 2007
Radiation Protection Dosimetry 2007 124(2):155-163; doi:10.1093/rpd/ncl544
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistical modelling of Poisson/log–normal data
Los Alamos National Laboratory MS-G761 Los Alamos, NM 87545, USA
* Corresponding author: guthrie{at}lanl.gov
Received June 23, 2006, amended November 27, 2006, accepted December 3, 2006
| ABSTRACT |
|---|
|
|
|---|
In statistical data fitting, self consistency is checked by examining the closeness of the quantity
2/NDF to 1, where
2 is the sum of squares of data minus fit divided by standard deviation, and NDF is the number of data minus the number of fit parameters. In order to calculate
2 one needs an expression for the standard deviation. In this note several alternative expressions for the standard deviation of data distributed according to a Poisson/log–normal distribution are proposed and evaluated by Monte Carlo simulation. Two preferred alternatives are identified. The use of replicate data to obtain uncertainty is problematic for a small number of replicates. A method to correct this problem is proposed. The log–normal approximation is good for sufficiently positive data. A modification of the log–normal approximation is proposed, which allows it to be used to test the hypothesis that the true value is zero.
| INTRODUCTION |
|---|
|
|
|---|
For radioactivity measurements obtained using counting methods, the number of measured counts N is usually distributed according to the Poisson distribution
|
| (1) |
is the true sample count rate and T is the sample count time (leaving aside the issue of background for the moment). An important source of uncertainty is the relationship between
and the quantity of interest
(e.g. 24 h urine excretion rate, or total lung activity),
|
| (2) |
This Poisson/log–normal model, including background, has been discussed in detail(3), and the exact likelihood function derived. Accepting this statistical model, it is sometimes useful to have a simpler characterisation of uncertainty than given by the exact likelihood function, e.g. to calculate
2 for data fitting. For the purposes of data fitting as envisioned in this note,
2 is defined as
|
| (3) |
i is a fit value and
i is the standard deviation of yi. There are i = 1 ... N measurements, which are assumed to be independent. The fitting function depends on a number M of parameters that are varied to minimise
2, and the expectation value of the minimum of
2 so obtained, barring singularities, is well known to be NDF = N – M, the number of degrees of freedom. This applies to nonlinear problems where
i is a nonlinear function of its parameters as long as certain requirements are met, principally that the JTJ (J=Jacobean) matrix is not singular at the minimum. In practice, the final minimum value of
2/NDF is inspected for closeness to 1 in order to test the hypothesis that the data may indeed be represented by the fit in a statistically self-consistent manner. For example, if there are a large number of similar fits, the average value of
2/NDF must equal 1 to be within statistical uncertainty. A special case is where the fitting function is 0, and the data are considered for consistency with 0. In this note expressions are derived for
i in the Poisson/log–normal model. In a data fitting situation such as this, the fit value, which is an estimator of the true mean value of yi, is also available for use in the expression for
i. The aim of the work is to obtain an expression for the standard deviation and thence
2 that has the proper expectation value (NDF). Alternate proposed expressions based on the theoretically derived standard deviation as well as replicate data, but using different choices of the available measurement and fit quantities, are tested using Monte Carlo.
In practice the uncertainty parameters associated with data are not often well known, and statistical data fitting usually involes reasonable adjustment of uncertainty parmeters to achieve the desired average value of
2/NDF = 1. In this way one arrives at a self-consistent set of uncertainty parameters and a modelling approach consistent with the data. This self-consistency check is quite important since subsequent uncertainty estimates depend on the assumed uncertainty parameters of the data.
Theoretical background
In the Poisson/log–normal model, the probability of measuring N gross counts in time T, given the true bioassay amount
and the true background count rate
b, is given by
|
| (4) |
|
| (5) |
For example, for a radiochemical measurement of urine excretion rate, the nominal value of the normalisation factor is given by
|
| (5A) |
count is the counting efficiency and
chem is the chemical yield. The uncertainty S results mostly from uncertainty in the excretion time, which depends on the sample collection protocol used and the techniques used to calculate the excretion time. This uncertainty is usually not included in the uncertainty reported by the analysis laboratory, even though it sometimes dominates. In Table 1 recommended values of S for various types of bioassay measurements are shown(4).
|
In this uncertainty model, there are only two sources of uncertainty, counting statistics and biological/sample collection variability. In fact, other sources of uncertainty are often insignificant, for example, uncertainty of tracer spike amount or tracer peak counting uncertainty in a radiochemical measurement using a tracer. In this simplified model, when the number of counts are small, counting statistics dominates, and conversely.
Note that the S values for some types of measurements might be large, for example, consider the situation with S = 3 for a room air monitor (corresponding to x or ÷ a factor of 20). This variability refers in this case to variability of measured air concentration from a room air monitor relative to true inhaled intake amount for a person in the room. When the measured counts are large, the biological/sample collection variability entirely dominates the uncertainty. Data fitting without including biological/sample collection variability (using just the measurement uncertainty reported by the analysis laboratory) would produce very large values of
2 and erroneously imply that the data are not consistent with the fit. A commonly used data fitting approach to this situation is to assume log–normal statistics (ignoring counting uncertainty) when the data are positive (above some decision level). But datasets are rarely all positive. What about the data that is not positive? Is it correct to ignore it? The methods derived here allow use of all the data in a unified manner.
Monte Carlo study
In the Moss study(1), many replicate samples were taken from individuals excreting constant, easily measured amounts of plutonium. The Monte Carlo study does the same thing numerically for different levels of urine excretion. Data are generated numerically in the following manner. Background counts Nb are generated from a Poisson distribution with mean µb=
bTb, where Tb is the background count time. The normalisation factor f is generated from the log–normal distribution given by Equation 5 to simulate biological/sample collection variability. The true mean value of the gross counts is calculated as
|
| (6) |
is the assumed true value of the quantity of interest (corresponding to the long-term average of urine excretion for an individual in the Moss study) and R = Tb/T is the ratio of background count time to sample count time. Gross counts N are then generated from a Poisson distribution with mean µg.
For each trial, the measurement result y is calculated as
|
| (7) |
y(means) normally reported by the analysis laboratory is obtained from N and Nb by assuming Poisson statistics,
|
| (8) |
y including both measurement and normalisation uncertainty?
To answer the question, first a theoretical calculation of the mean and standard deviation of gross counts from Equation 4 is carried out. By interchanging the summation with the integration and using the well-known expressions for moments of the Poisson distribution, the first and second moments of the gross counts are given by (E(.) denotes the expectation)
|
| (9) |
|
| (10) |
|
| (11) |
|
| (12) |
Combining expressions leads to the following:
|
| (13) |
|
| (14) |
|
| (15) |
2 defined as follows
|
| (16) |
2/Ntrials must approach 1 for a large number of trials. In data fitting, the true count rates µ and µb are not known so that Equation 13 cannot be directly applied. However, in addition to the data value, the fit value is available for each data point. The fit value corresponds to the true value in the above discussion; it is the value assumed for the true value. The uncertainty standard deviation can be constructed making use of the fit value in addition to measured quantities.
In order to test practical formulas that could be used in fitting data, the quantity
y in the denominator of Equation 16 is calculated in five different ways, the first being the verification and validation case:
- Using theoretically derived uncertainties from Equation 13.
- Using the estimate based on some number Nrep of replicate data. The measurement value y is then the mean result from Nrep samples and the uncertainty standard deviation is given by
In both alternatives (4) and (5), the true value of net counts is used rather than the measured value. In practice in a fitting procedure, µ =
/ fo would be obtained from the fit value.
The problem of zero counts giving zero calculated standard deviation (N = Nb = 0 causes alternative 3 to be 0) is resolved in two different ways in alternatives 4 and 5, using noninteger minimum count quantities Nmin and Nminb. Since the formula for standard deviation is based on estimating the true count rates µ and µb in terms of integer measured counts, it is equally valid to modify the formula by changing the measured counts by some fractional amount up to 1 count (e.g. replacing N = 0 gross counts by 1). The exact values of these parameters, the quantities Nmin and Nminb, might be obtained by an optimisation process that produces
2/Ntrials = 1 for all cases in the numerical study described below. Unfortunately, and not surprisingly, the optimum values of these parameters depend on the true values µ and µb, as well as the ratio of background count time to count time R. The exact values of these parameters are only important for small values of the counts, and the only essential is for them to be positive in order to eliminate the zero standard deviation catastrophe when the measured counts are zero.
Alternatives (4) and (5) are proposed here as the best. Alternate (4) can be expressed as
|
| (17) |
|
| (18) |
Numerical results
In the numerical tests,
2/Ntrial must equal 1 to be within the Monte Carlo statistical error. This requirement serves as a verification and validation check for both the theory and the numerical computations. The value of
2/Ntrials obtained for the other alternatives is a measure of validity and usefulness of each of the approximations.
Numerical generation of data and calculation of
2/Ntrials was done for the 5 uncertainty estimation techniques described above. The true average background counts were assumed to be µb = 4, and the ratio of background count time to sample count time R = 6. Three different levels of true amount
, low, medium and high, were used, with
/
y(meas) = 0.3, 1 and 3. Three replicate measurements were assumed (Nrep = 3). A large number of trials Ntrials =30,000 was generated in order to have small statistical uncertainty. The minimum count quantities where assumed to be Nmin = 0.5 and Nminb = 1. Results are shown in Table 2 for S = 0.5.
|
From Table 2 it is seen that the true uncertainties give
2/Ntrials = 1 as must be the case. The same is true for replicate uncertainties if Nrep is large. The proposed Poisson alternatives (4) and (5) are optimum.
Replicate data
By replicate data it is meant that a number of repeated measurements are used to calculate an average measurement value and the standard deviation of that average. If the number of measurements Nrep is large, this is a valid method of determining data uncertainty. The problem comes when the number of measurements is not large. Table 2 shows that replicate data with Nrep small has
2/Ntrials in the range 7–12, leading one to conclude erroneously that the data are inconsistent with the functions from which they are generated. Since some important existing datasets involve replicate data (e.g. Mayak urine data(5)), a method is needed to correct this problem.
The basic method proposed here is to multiply the uncertainty obtained from the standard deviation of the replicate measurements (divided by
) by a factor F. This reduces
2/Ntrials by F2. Table 2 shows that choosing F about 3 reduces
2/Ntrials to
1 for Nrep = 3. However, as shown in Table 2
2/Ntrials depends somewhat on the data level
/
y(meas), and this approach only approximately gives
2/Ntrials =1. So, a refinement is necessary.
The refinement consists in making the replacements
|
| (19) |
y(min) and S(min). The upper Equation 19 impacts low level data for which
y(rep) is small. The lower Equation 19 particularly impacts high level data where
is large. The parameters
y(min) and S(min) are chosen with respect to a particular dataset by first examining histograms showing distributions of log(
y) and log(Sy), where
y are the standard deviations calculated using the replicate data and Sy are the standard deviation of the logs from the replicate data. Such histograms for the numerical study described above are shown in Figures 1 and 2 (in Figure 1,
y is shown before multiplying by the factor F). Then, determine cuts
y(min) and S(min) that demark a small fraction of the smallest values in each distribution. Thus, according to Equation 19, only this small fraction will have significantly different values of
y(rep). For this small fraction, the overall effect of Equation 19 is to selectively reduce
2/Ntrials for low-level and high-level data, respectively, and values of
y(min) and S(min) can be chosen to make
2/Ntrials =1 for all three data levels shown in Table 2.
|
|
The Mayak urine dataset(5) consists of 24 h urine samples collected on a few successive days long after the beginning of employment and often after employment has ended. To illustrate the problem with these data in a Monte Carlo study, appropriate parameter values need to be chosen. From Table 1 for true 24 h urine data, S = 0.1 would normally be appropriate. However, there is an additional normalisation uncertainty associated with the measurement process (see Appendix), which did not use a radiochemical tracer and relied on uncertain estimation of chemical processing yield. So, for the Monte Carlo study S is assumed to be 0.3. The three parameters F = 2.4,
y(min) = 0.02 mBq d–1 and S(min) = 0.01 (for the experimental values R = 1, µb = 100, f0 = 0.02 mBq per (d-count)) were determined by the requirement that
2/Ntrials = 1 for the replicate data, for the three data levels: low, medium and high. The parameters
y(min) and S(min) are shown in Figures 1 and 2 superimposed on histograms of
y and Sy to show that only a small fraction of cases are strongly affected by the replacements given by Equation 19. Such a proposed procedure would, in practice, involve varying the three parameters F,
y(min) and S(min) to some extent and observing
2/NDF for the data in question.
Log–normal approximation
The log–normal distribution might be considered as the basic distribution and
2 defined in terms of log-transformed quantities. In Equation 4, the Poisson distribution may be approximated as a normal (gaussian) distribution by use of the relation
|
| (20) |
|
| (21) |
|
| (22) |
|
| (23) |
Equation 21 then becomes the convolution of two log–normals, and a
|
| (24) |
|
| (25) |
In the limit that Sy << S, the log–normal approximation is accurate, and
2 is given by
|
| (26) |
|
| (27) |
is small. Results corresponding to Table 2 are shown in Table 3. The minimum count quantity Nmin = 0.5 is used.
|
The log–normal approximation is accurate when Sy/S (2 divided by the quantity in column 1 of Table 3) in Equation 25 is small. For definiteness, the log–normal approximation would be recommended when
|
| (28) |
Use of the log–normal approximation automatically implies that the true value is greater than zero, whether or not this is actually true. However, the value of the likelihood function at
= 0 can be evaluated directly, and is given by
|
| (29) |
is the ratio of the likelihood at 0 to the maximum likelihood, given by
|
| (30) |
The idea is to modify the log–normal formula by limiting
to be greater than a minimum value
min based on
,
|
| (31) |
|
| (32) |
is changed gradually as follows:
|
| (33) |
< y. With this modification, the log–normal approximation reproduces the correct value of the likelihood at
= 0. | DISCUSSION |
|---|
|
|
|---|
The difference between Poisson/LN alternatives when S is large and the net counts moderately large should be emphasised. The basic Poisson/LN formula (alternative 3) then gives a standard deviation CV(S) times larger than the measurement value. So, an interpretation of the data as arising from zero true value would never be ruled out. All data are consistent with zero! The power of the data to discern the null hypothesis is lost. In contrast, alternatives (4) and (5) result in a standard deviation that is a function of the true value
. If
2 is evaluated for
= 0 to test the hypothesis of zero true amount, the basic Poisson/LN calculation would give a value implying consistency with the hypothesis, while alternatives (4) and (5) might give a very large value, ruling out the hypothesis.
A numerical example may be helpful. Assume S = 3 (as for a room air monitor, giving A(S) = CV(S) = 90), N = 4 gross counts and a well-determined average background of 1 count (
T = Nb/R=1, R large). The normalisation factor is assumed to be 1. To test the hypothesis that the true amount is zero,
2 is evaluated for
= 0. The measurement would be reported as 3 ± 2. The log–normal inequality, Equation 28, is marginally satisfied (2/(3 x 3) < 1/3), so all the alternatives are considered including the log–normal. The standard deviation calculated using alternatives (3) through (5) is, respectively, 270 = 3 x 90, 2 and 1, leading to the following values of
2: 1/902 = 0.0001, (3/2)2 = 2.25, and (3/1)2 = 9. The modified log–normal approximation gives
|
| (34) |
2 much less than 1) while alternatives (4), (5) and the modified log–normal approximation imply, with differing degrees of certainty, that the data point is not consistent with zero, and therefore something measurable is present. The correct Bayesian interpretation of this data point requires calculation and examination of the posterior probability distribution, which is the likelihood function multiplied by the prior probability distribution (Ref 3). The exact likelihood function as well as the log–normal approximation are shown in Figure 3. With an alpha prior corresponding to rare true positives, these data would not be interpreted as something detected; however, with a broad log–normal prior, it would lead to that conclusion.
|
Alternative (5) is attractive theoretically. The measurements are then characterised by measurement value, S value (e.g. from Table 1), measurement standard deviation for zero true amount and median normalisation factor. Alternative (4) requires only the measurement value and its standard deviation as usually reported by an analysis laboratory and S value.
An important conclusion of this work is that reliance on replicate data with a small number of replicates to provide uncertainties (without recording count quantities) is not a good practice. The problem can be understood as being caused by the uncertainty of sample standard deviation as an estimator of the population standard deviation when the sample is small. When the sample standard deviation fluctuates towards zero, this has a pronounced effect in increasing
2 because the standard deviation enters as a square in the denominator in the formula for
2.
| APPENDIX—ADDITIONAL LOG–NORMAL BACKGROUND |
|---|
|
|
|---|
If an additional log-normally distributed background b is present and included in the analysis (e.g. for uranium environmental background in urine), it is important to distinguish measurement normalisation fm and its attendant variability and uncertainty from overall normalisation. Measurement normalisation uncertainty might be caused by uncertainty in radiochemical yield for a radiochemical measurement of urine not using a tracer. The overall normalisation factor is the product of fm with another factor fn (already described in the body of the paper) giving variability of the normalisation from quantity of interest (e.g. true urine excretion rate) to measurement quantity (activity in sample).
If a constant background b is measured, the true value of mean counts caused by this background will be b/fm. The actual background is assumed to be log-normally distributed, i.e. it is assumed to be given by b fb, where b is a constant and fb is log-normally distributed with median 1. The first and second moments of the gross counts are then given by (E(.) denotes the expectation)
|
| (361) |
After a straightforward but long calculation, the following result is obtained.
|
| (36) |
/f. | REFERENCES |
|---|
|
|
|---|
- Moss W.D., Campbell E. E., Schulte H. F., Tietjen G. L. A study of the variations found in plutonium urinary data. Health Phys (1969) 17:571–578.[Web of Science][Medline]
- Khokhryakov V.F, Suslova K. G., Vostrotin V. V., Romanov S. A., Eckermann K. F., Krahenbuhl M. P., Miller S. C. Adaptation of the ICRP publication 66 respiratory tracft model to data on plutonium biokinetics for Mayak workers. Health Phys (2005) 88:125–132.[CrossRef][Web of Science][Medline]
- Miller G., Martz H.F, Little T., Guilmette R. Using exact poisson likelihood functions in Bayesian interpretation of counting measurements. Health Phys (2002) 83:512–518.[CrossRef][Web of Science][Medline]
- Miller G. Likelihood functions and uncertainties. In: Los Alamos Radiological Dose Assessment Technical Issue Paper 005 (2005) Los Alamos, NM): (Los Alamos National Laboratory.
- Romanov S.A., Vasilenko E. K., Khokhryakov J. P. Studies on the Mayak nuclear workers: dosimetry. Radiat. Environ. Biophys (2002) 41(1):23–28.
- Miller G., Inkret W. C. Excretion-time normalization of urine samples using specific gravity. In: Los Alamos Radiological Dose Assessment Technical Issue Paper 011 (1999) Los Alamos, NM: (Los Alamos National Laboratory.
This article has been cited by other articles:
![]() |
G. Miller, V. Vostrotin, and V. Vvedensky UNCERTAINTIES OF MAYAK URINE DATA Radiat Prot Dosimetry, March 26, 2009; (2009) ncp024v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. W. Marsh, C. M. Castellani, C. Hurtgen, M. A. Lopez, A. Andrasi, M. R. Bailey, A. Birchall, E. Blanchardon, A. D. Desai, M.-D. Dorrian, et al. INTERNAL DOSE ASSESSMENTS: UNCERTAINTY STUDIES AND UPDATE OF IDEAS GUIDELINES AND DATABASES WITHIN CONRAD PROJECT Radiat Prot Dosimetry, October 6, 2008; (2008) ncn218v2. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Miller, L. Bertelli, and R. Guilmette IMPDOS (improved dosimetry and risk assessment for plutonium-induced diseases): internal dosimetry software tools developed for the Mayak worker study Radiat Prot Dosimetry, September 1, 2008; 131(3): 308 - 315. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Miller, R. Guilmette, L. Bertelli, T. Waters, S. A. Romanov, and Y. V. Zaytseva Uncertainties in internal doses calculated for mayak workers--a study of 63 cases Radiat Prot Dosimetry, September 1, 2008; 131(3): 316 - 330. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Miller, D. Melo, H. Martz, and L. Bertelli An empirical multivariate log-normal distribution representing uncertainty of biokinetic parameters for 137Cs Radiat Prot Dosimetry, August 1, 2008; 131(2): 198 - 211. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Miller, L. Bertelli, R. Guilmette, M. W. McNaughton, and W. F. Eisele A study of early Los Alamos internal exposures to plutonium Radiat Prot Dosimetry, July 1, 2008; 130(4): 503 - 509. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. V.S.V. Neti and R. W. Howell Lognormal Distribution of Cellular Uptake of Radioactivity: Statistical Analysis of {alpha}-Particle Track Autoradiography J. Nucl. Med., June 1, 2008; 49(6): 1009 - 1016. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




























