One trick is to collect k facets, compute their normals, and k vertices and compute the potentials for each of the k2 pairs of facet and vertex. If the k facets fit in some level of cache then these k2 calculations will go much faster. Each of the vertices needs to accumulate a potential and these k potentials might be kept in cache. The cache must hold the centroid and normal for each facet (6 floats), and the location and accumulating potential for each vertex (4 floats). (This is reminiscent of IBM 704 tape logic.)
The “cache savvy” alternative in this code uses this idea.