Publications

M. Croci, G. N. Wells, Mixed-precision finite element kernels and assembly: Rounding error analysis and hardware acceleration, Preprint (2024). Preprint. Software.

In this paper we develop the first fine-grained rounding error analysis of finite element (FE) cell kernels and assembly. The theory includes mixed-precision implementations and accounts for hardware-acceleration via matrix multiplication units, thus providing theoretical guidance for designing reduced- and mixed-precision FE algorithms on CPUs and GPUs. Guided by this analysis, we introduce hardware-accelerated mixed-precision implementation strategies which are provably robust to low-precision computations. Indeed, these algorithms are accurate to the lower-precision unit roundoff with an error constant that is independent from: the conditioning of FE basis function evaluations, the ill-posedness of the cell, the polynomial degree, and the number of quadrature nodes. Consequently, we present the first AMX-accelerated FE kernel implementations on Intel Sapphire Rapids CPUs. Numerical experiments demonstrate that the proposed mixed- (single/half-) precision algorithms are up to 60 times faster than their double precision equivalent while being orders of magnitude more accurate than their fully half-precision counterparts.

M. Croci, K. E. Willcox, S. J. Wright, Multi-output multilevel best linear unbiased estimators via semidefinite programming, Computer Methods in Applied Mechanics and Engineering (2023). Article. Software.

Multifidelity forward uncertainty quantification (UQ) problems often involve multiple quantities of interest and heterogeneous models (e.g., different grids, equations, dimensions, physics, surrogate and reduced-order models). While computational efficiency is key in this context, multi-output strategies in multilevel/multifidelity methods are either sub-optimal or non-existent. In this paper we extend multilevel best linear unbiased estimators (MLBLUE) to multi-output forward UQ problems and we present new semidefinite programming formulations for their optimal setup. Not only do these formulations yield the optimal number of samples required, but also the optimal selection of low-fidelity models to use. While existing MLBLUE approaches are single-output only and require a non-trivial nonlinear optimization procedure, the new multi-output formulations can be solved reliably and efficiently. We demonstrate the efficacy of the new methods and formulations in practical UQ problems with model heterogeneity.

M. Croci, G. Rosilho de Souza, Mixed-precision explicit stabilized Runge-Kutta methods for single- and multi-scale differential equations, Journal of Computational Physics (2022). Article. Slides.

Mixed-precision algorithms combine low- and high-precision computations in order to benefit from the performance gains of reduced-precision without sacrificing accuracy. In this work, we design mixed-precision Runge-Kutta-Chebyshev (RKC) methods, where high precision is used for accuracy, and low precision for stability. Generally speaking, RKC methods are low-order explicit schemes with a stability domain growing quadratically with the number of function evaluations. For this reason, most of the computational effort is spent on stability rather than accuracy purposes. In this paper, we show that a naïve mixed-precision implementation of any Runge-Kutta scheme can harm the convergence order of the method and limit its accuracy, and we introduce a new class of mixed-precision RKC schemes that are instead unaffected by this limiting behaviour. We present three mixed-precision schemes: a first- and a second-order RKC method, and a first-order multirate RKC scheme for multiscale problems. These schemes perform only the few function evaluations needed for accuracy (1 or 2 for first- and second-order methods respectively) in high precision, while the rest are performed in low precision. We prove that while these methods are essentially as cheap as their fully low-precision equivalent, they retain the convergence order of their high-precision counterpart. Indeed, numerical experiments confirm that these schemes are as accurate as the corresponding high-precision method.

M. Kloewer, S. Hatfield, M. Croci, P. Dueben, T. Palmer, Fluid simulations accelerated with 16 bit: Approaching 4x speedup on A64FX by squeezing ShallowWaters.jl into Float16, Journal of Advances in Modeling Earth Systems (2021). Article.

Most Earth-system simulations run on conventional CPUs in 64-bit double precision floating-point numbers Float64, although the need for high-precision calculations in the presence of large uncertainties has been questioned. Fugaku, currently the world’s fastest supercomputer, is based on A64FX microprocessors, which also support the 16-bit low-precision format Float16. We investigate the Float16 performance on A64FX with ShallowWaters.jl, the first fluid circulation model that runs entirely with 16-bit arithmetic. The model implements techniques that address precision and dynamic range issues in 16 bit. The precision-critical time integration is augmented to include compensated summation to minimize rounding errors. Such a compensated time integration is as precise but faster than mixed-precision with 16 and 32-bit floats. As subnormals are inefficiently supported on A64FX the very limited range available in Float16 is 6.10-5 to 65504. We develop the analysis-number format Sherlogs.jl to log the arithmetic results during the simulation. The equations in ShallowWaters.jl are then systematically rescaled to fit into Float16, using 97% of the available representable numbers. Consequently, we benchmark speedups of 3.8x on A64FX with Float16. Adding a compensated time integration the speedup is 3.6x. Although ShallowWaters.jl is simplified compared to large Earth-system models, it shares essential algorithms and therefore shows that 16-bit calculations are indeed a competitive way to accelerate Earth-system simulations on available hardware.

M. Croci, M. Fasi, N. J. Higham, T. Mary, M. Mikaitis, Stochastic Rounding: Implementation, Error Analysis and Applications, Royal Society Open Science (2022). Article.

Stochastic rounding randomly maps a real number to one of the two nearest values in a finite precision number system. First proposed for use in computer arithmetic in the 1950s, it is attracting renewed interest. If used in floating-point arithmetic in the computation of the inner product of two vectors of length n, it yields an error bounded by \sqrt(n)u with high probability, where u is the unit roundoff, which is not necessarily the case for round to nearest. A particular attraction of stochastic rounding is that, unlike round to nearest, it is immune to the phenomenon of stagnation, whereby a sequence of tiny updates to a relatively large quantity are lost. We survey stochastic rounding, covering its mathematical properties and probabilistic error analysis, its implementation, and its use in applications, including deep learning and the numerical solution of differential equations.

M. Croci, M. B. Giles, Effects of round-to-nearest and stochastic rounding in the numerical solution of the heat equation in low precision, IMA Journal of Numerical Analysis (2022). Preprint. Slides.

Motivated by the advent of machine learning, the last few years saw the return of hardware-supported low-precision computing. Computations with fewer digits are faster and more memory and energy efficient, but can be extremely susceptible to rounding errors. An application that can largely benefit from the advantages of low-precision computing is the numerical solution of partial differential equations (PDEs), but a careful implementation and rounding error analysis are required to ensure that sensible results can still be obtained. In this paper we study the accumulation of rounding errors in the solution of the heat equation, a proxy for parabolic PDEs, via Runge-Kutta finite difference methods using round-to-nearest (RtN) and stochastic rounding (SR). We demonstrate how to implement the scheme to reduce rounding errors and we derive a priori estimates for local and global rounding errors. Let u be the roundoff unit. While the worst-case local errors are O(u) with respect to the discretization parameters, the RtN and SR error behavior is substantially different. We prove that the RtN solution is discretization, initial condition and precision dependent, and always stagnates for small enough Δt. Until stagnation, the global error grows like O(u \Delta t^{-1}). In contrast, we show that the leading order errors introduced by SR are zero-mean, independent in space and mean-independent in time, making SR resilient to stagnation and rounding error accumulation. In fact, we prove that for SR the global rounding errors are only O(u\Delta t^{-1/4}) in 1D and are essentially bounded (up to logarithmic factors) in higher dimensions.

M. Croci, M. B. Giles, M. E. Rognes, and P. E. Farrell, Multilevel quasi Monte Carlo methods for elliptic partial differential equations driven by spatial white noise, SIAM Journal on Scientific Computing. Preprint, Talk.

When solving partial differential equations driven by additive spatial white noise, the efficient sampling of white noise realizations can be challenging. In this paper we focus on the efficient sampling of white noise using quasi-random points in a finite element method and multilevel Quasi Monte Carlo (MLQMC) setting. This work is an extension of previous research on white noise sampling for MLMC.

We express white noise as a wavelet series expansion that we divide in two parts. The first part is sampled using quasi-random points and contains a finite number of terms in order of decaying importance to ensure good QMC convergence. The second part is a correction term which is sampled using standard pseudo-random numbers.

We show how the sampling of both terms can be performed in linear time and memory complexity in the number of mesh cells via a supermesh construction. Furthermore, our technique can be used to enforce the MLQMC coupling even in the case of non-nested mesh hierarchies. We demonstrate the efficacy of our method with numerical experiments.

M. Croci, V. Vinje and M. E. Rognes, Fast uncertainty quantification of tracer distribution in the brain interstitial fluid with multilevel and quasi Monte Carlo, International Journal for Numerical Methods in Biomedical Engineering. Preprint.

Efficient uncertainty quantification algorithms are key to understand the propagation of uncertainty – from uncertain input parameters to uncertain output quantities – in high resolution mathematical models of brain physiology. Advanced Monte Carlo methods such as quasi Monte Carlo (QMC) and multilevel Monte Carlo (MLMC) have the potential to dramatically improve upon standard Monte Carlo (MC) methods, but their applicability and performance in biomedical applications is underexplored. In this paper, we design and apply QMC and MLMC methods to quantify uncertainty in a convection-diffusion model of tracer transport within the brain. We show that QMC outperforms standard MC simulations when the number of random inputs is small. MLMC considerably outperforms both QMC and standard MC methods and should therefore be preferred for brain transport models.

M. Croci and P. E. Farrell, Complexity bounds on supermesh construction for quasi-uniform meshes, Journal of Computational Physics 414 (2020), pp. 1–7. Article. Preprint.

Projecting fields between different meshes commonly arises in computational physics. This operation may require a supermesh construction and in this case its computational cost is proportional to the number of cells of the supermesh n. Given any two quasi-uniform meshes of nA and nB cells respectively, we show under standard assumptions that n is proportional to nA + nB. This result substantially improves on the best currently available upper bound on n and is fundamental for the analysis of algorithms that use supermeshes.

M. Croci, V. Vinje, M. E. Rognes, Uncertainty quantification of parenchymal tracer distribution using random diffusion and convective velocity fields, Fluids and Barriers of the CNS 16-1 (2019), pp. 1-32. Article.

Background: Influx and clearance of substances in the brain parenchyma occur by a combination of diffusion and convection, but the relative importance of thiese mechanisms is unclear. Accurate modeling of tracer distributions in the brain relies on parameters that are partially unknown and with literature values varying up to 7 orders of magnitude. In this work, we rigorously quantified the variability of tracer enhancement in the brain resulting from uncertainty in diffusion and convection model parameters.
Methods: In a mesh of a human brain, using the convection-diffusion-reaction equation, we simulated tracer enhancement in the brain parenchyma after intrathecal injection. Several models were tested to assess the uncertainty both in type of diffusion and velocity fields and also the importance of their magnitude. Our results were compared with experimental MRI results of tracer enhancement.
Results: In models of pure diffusion, the expected amount of tracer in the gray matter reached peak value after 15 hours, while the white matter does not reach peak within 24 hours with high likelihood. Models of the glymphatic system behave qualitatively similar as the models of pure diffusion with respect to expected time to peak but display less variability. However, the expected time to peak was reduced to 11 hours when an additional directionality was prescribed for the glymphatic circulation. In a model including drainage directly from the brain parenchyma, time to peak occured after 6-8 hours for the gray matter. Conclusion: Even when uncertainties are taken into account, we find that diffusion alone is not sufficient to explain transport of tracer deep into the white matter as seen in experimental data. A glymphatic velocity field may increase transport if a directional structure is included in the glymphatic circulation