[version 2; peer review: 2 approved]
This revised version improves the exposition in many places, thanks to helpful feedback from two peer reviewers, and adds detailed supplementary material on how empirical privacy loss (EPL) functions in the case of three simplified examples, including (1) when EPL is very close to epsilon; (2) when EPL is substantially less than epsilon, due to slack in the inequality in the Sequential Composition Theorem; and (3) when EPL is substantially more than epsilon, due to invariants.
DP  differentially private
E2E  endtoend
TC  total count
SC  stratified count
MAE  median absolute error
EPL  empirical privacy loss
In the United States, the Decennial Census is an important part of democratic governance. Every ten years, the US Census Bureau is constitutionally required to count the “whole number of persons in each State,” and in 2020 this effort is likely to cost over 15 billion dollars ^{ 1, 2 }. The results will be used for apportioning representation in the US House of Representatives and dividing federal tax dollars between states, as well as for a multitude of other governmental activities at the national, state, and local levels. Data from the decennial census will also be used extensively by sociologists, economists, demographers, and other researchers, and it will also inform strategic decisions in the private and nonprofit sectors, and facilitate the accurate weighting of subsequent population surveys for the next decade ^{ 3 }.
The confidentiality of information in the decennial census is also required by law, and the 2020 US Census will use a novel approach to “disclosure avoidance” to protect respondents’ data ^{ 4 }. This approach builds on Differential Privacy, a mathematical definition of privacy that has been developed over the last decade and a half in the theoretical computer science and cryptography communities ^{ 5 }. Although the new approach allows a more precise accounting of the variation introduced by the process, it also risks reducing the utility of census data—it may produce counts that are substantially less accurate than the previous disclosure avoidance system, which was based on redacting the values of table cells below a certain size (cell suppression) and a technique called swapping, where pairs of households with similar structures but different locations had their location information exchanged in a way that required that the details of the swapping procedure be kept secret ^{ 6 }.
To date, there is a lack of empirical examination of the new disclosure avoidance system, but the approach was applied to the 2018 endtoend (E2E) test of the decennial census, and computer code used for this test as well as accompanying exposition has recently been released publicly by the Census Bureau ^{ 4, 7 }.
We used the recently released code, preprints, and data files to understand and quantify the error introduced by the E2E disclosure avoidance system when the Census Bureau applied it to 1940 census data (for which the individuallevel data has previously been released ^{ 8 }) for a range of privacy loss budgets. We also developed an empirical measure of privacy loss and used it to compare the error and privacy of the new approach to that of a (nondifferentially private) simplerandomsampling approach to protecting privacy.
A randomized algorithm for analyzing a database is differentially private (DP) if withholding or changing one person’s data does not substantially change the algorithm’s output. If the results of the computation are roughly the same whether or not my data are included in the database, then the computation must be protecting my privacy. DP algorithms come with a parameter
To be precise, a randomized algorithm is
Differential privacy is a characteristic of an algorithm; it is not a specific algorithm. Algorithms often achieve differential privacy by adding random variation ^{ 5 }.
The new disclosure avoidance system for the 2020 US Census is designed to be DP and to maintain the accuracy of census counts. To complicate things beyond the typical challenge faced in DP algorithm design, there are certain counts in the census that will be published precisely as enumerated, without any variation added. These invariants have not been selected for the 2020 decennial census yet, but in the 2018 endtoend (E2E) test, the total count for each state and the number of households in each enumeration district were invariants. There are also inequalities that will be enforced. The E2E test required the total count of people in an enumeration district to be greater or equal to the number of occupied households in that district ^{ 9 }.
At a high level, the census approach to this challenge repeats two steps for multiple levels of a geographic hierarchy (from the top down, hence their name “TopDown”). The first step (Imprecise Histogram) adds variation from a carefully chosen distribution to the stratified counts of individuals. This produces a set of counts with illogical inconsistencies, which we refer to as an “imprecise histogram”. For example, counts in the imprecise histogram might be negative, might violate invariants or other inequalities, or might be inconsistent with the counts that are one level up in the geographic hierarchy. The second step (Optimize) finds optimized counts for each mostdetailed cell in the histogram, using constrained convex optimization to make them as close as possible to the counts in the imprecise histogram, subject to the constraints that the optimized counts be nonnegative, consistent with each other and the higher levels of the hierarchy, and satisfy the invariants and inequalities. These two steps are performed for each geographic level, from the coarsest to the finest. Each level is assigned a privacy budget
The aggregate statistics (internally called “DP queries” in the TopDown algorithm) afford a way to choose specific statistics that are more important to keep accurate, and the E2E test included two such aggregates: a household/groupquarters query, which increases the accuracy of the count of each household type at each level of the hierarchy, and a race/ethnicity/age query, which increases the accuracy of the stratified counts of people by race, ethnicity, and voting age across all household/groupquarters types (again for each level of the spatial hierarchy). It also included “detailed queries” corresponding to boxes in the histogram. The detailed queries were afforded 10% of the privacy budget at each level, while the DP queries split the remaining 90% of the privacy budget, with 22.5% spent on the household/groupquarters queries and 67.5% spend on the race/ethnicity/age queries.
The epsilon budget of the level governed how much total random variation to add. A further parameterization of the epsilon budget determined how the variance was allocated between the histogram counts and each type of aggregate statistic. We write
where
The imprecise counts and imprecise aggregate statistics are unbiased estimates with variance (1 – exp(–
The variation added to each histogram count comes from the same distribution, and is independent of all other added variation; the variance does not scale with the magnitude of count, e.g. adding 23 people to the count of age 18 and older nonHispanic Whites is just as likely as adding 23 people to the count of age under 18 Hispanic Native Americans, even though the population of the latter is smaller.
We note that the approach that Census Bureau has taken with the TopDown where imprecise histogram data is optimized based on internal consistency has been developed in a line of research over the last decade to that has focused on obtaining count data that is DP
As described above, the privacy loss of a DP algorithm is quantified by a unitless number,
It is possible to empirically quantify privacy loss, which has the potential to show that the inequality of the sequential composition theorem is not tight. The brute force approach quantify privacy loss empirically is to search over databases
For algorithms that produce DP counts of multiple subpopulations, such as TopDown, it is possible to use the distribution of the residual difference between the precise count and the DP count to derive a proxy of the distribution produced by the brute force approach ^{ 14 }. The special structure of count queries affords a way to avoid rerunning the algorithm repeatedly, which is essential for TopDown, since it takes several hours to complete a single run of the algorithm. Assuming that the residual difference of the DP count minus the precise count is identically distributed for queries across similar areas (such as votingage population across all enumeration districts), and then instead of focusing on only the histogram counts containing the individual who has changed, we used the residuals for all areal units to estimate the probability of the event we are after:
where error
_{
j
} is the residual difference of DP counts returned by TopDown minus the precise count for that same quantity in the 1940 census, and the error
_{
j'
} are residuals for
To measure the empirical privacy loss (EPL), we approximated the probability distribution of the residuals (DP count minus precise count at a selected level of the geographic hierarchy), which we denote
See Supplementary Methods Appendix for additional detail on the design and validation of the EPL metric ^{ 15 }.
There are seven key choices in implementing TopDown, that balance accuracy and privacy. We list them here, and state how they were set in the 2018 endtoend test when run on the 1940s Census data:
Overall privacy. A range of
How to split this budget between national, state, county, tract, block group, and block. In the test run,
What aggregate statistics (also known as “DP Queries”) to include. In the test, two DP Queries were included: (i) counts stratified by agegroup/race/ethnicity (and therefore aggregated over household/groupquarters type); and (ii) the household/groupquarters counts, which tally the total number of people living in each type of housing (in a household, in institutional facilities of certain types, in noninstitutional facilities of certain types).
At each level, how to split levelbudget between detailed queries and DP queries. The test run used 10% for detailed queries, 22.5% for household/groupquarters; and 67.5% for agegroup/race/ethnicitystratified counts.
What invariants to include. The test run held the total population count at the national and state level invariant.
What constraints to include. The test run constrained the total count of people to be greater or equal to total count of occupied households at each geographic level.
What to publish. The test run published a synthetic person file and synthetic household file for a range of
We calculated residuals (DP count minus precise count) and summarized their distribution by its median absolute error (MAE) for total count (TC) and age/race/ethnicity stratified count (SC) at the state, county, and enumerationdistrict level. We also summarized the size of these counts from the precisecount versions to understand relative error as well as the absolute error introduced by TopDown.
We calculated a measure of empirical privacy loss (EPL), inspired by the definition of differential privacy. To measure EPL, we approximated the probability distribution of the residuals (DP count minus precise count at a selected level of the geographic hierarchy), which we denote
See Supplementary Methods Appendix for additional detail on the design and validation of the EPL metric
^{
15
}. We hypothesized that the EPL of TopDown will be substantially smaller than the theoretical guarantee of
We searched for bias in the residuals from (1), with our hypothesis that the DP counts are larger than precise counts in spatial areas with high homogeneity and DP counts are smaller than precise counts in areas with low homogeneity. We based this hypothesis on the expected impact of the nonnegativity constraints included in the optimization steps of the TopDown algorithm. For each detailed query with a negative value for its noisy count, the optimization step will increase the value to make the results logical, and this reduction in variance must tradeoff some increase in bias. To quantify the scale of the bias introduced by optimization, for each geographic area, we constructed simple homogeneity index by counting the cells of the detailed histogram that contained a precise count of zero, and we examined the bias, defined as the mean of the DP count minus precise count, for these areas when stratified by homogeneity index.
We also compared the median absolute error and empirical privacy loss of TopDown to a simpler, but notdifferentiallyprivate approach to protecting privacy, Simple Random Sampling (i.e. sampling without replacement) for a range of sized samples. To do this, we generated samples without replacement of the 1940 Census Data for a range of sizes, and applied the same calculations from (1) and (2) to this alternatively perturbed data.
Recall that geographic areas are nested: enumeration districts are contained within counties, which are contained within states. We found error in total count (TC) varied as a function of total privacy loss budget. Running TopDown with
Error in stratified count (SC) varied similarly; when
Panel (a) shows the distribution of residuals (DP  Precise) for stratified counts at the enumeration district level, stratified by age, race, and ethnicity; and panel (b) shows the empirical privacy loss function,
We found that the empirical privacy loss was often substantially smaller than the privacy loss budget. For
This relationship between privacy loss budget and empirical privacy loss was similar for stratified counts (SC) at the enumeration district and county level, but for privacy loss budgets of 1.0 and less, the empirical privacy at the enumeration district level was loss for SC was not as responsive to
We found that the MAE and EPL of Simple Random Sampling (i.e. sampling uniformly, without replacement) varied with larger sample size in a manner analogous to the total privacy budget in TopDown, for
Error in stratified count varied similarly; for a 5% sample, we found median absolute error in SC of 18 at the enumeration district level, 19 at the county level, and 41 at the state level; a 50% sample produced median absolute error in TC of 4 at the enumeration district level, 5 at the county level, and 9 at the state level.
We found empirical privacy loss increased as sample size increased. For a 5% sample, at the enumeration district level, we found EPL of 0.020 for TC and 0.098 for SC, and at the county level, we found 0.035 for TC and 0.034 for SC; a 50% sample produced EPL of 0.079 for TC and 0.318 for SC at the enumeration district level, and 0.082 for TC and 0.150 for SC at the county level; and a 95% sample produced EPL of 0.314 for TC and 1.333 for SC at the enumeration district level, and 0.429 for TC and 0.612 for SC at the county level (
The curve with circular markers shows that in TopDown, the choice of
Privacy

Closest SRS sample


1.0  50% 
2.0  75% 
4.0  90% 
6.0  95% 
The bias introduced by TopDown varied with homogeneity index, as hypothesized. Enumeration districts with homogeneity index 0 (0 empty cells in the detailed histogram) had TC systematically lower than the precise count, while enumeration districts homogeneity index 22 (the maximum number of empty cells observed in the detailed histogram) had TC systematically higher than the precise count. The size of this bias decreased as a function of
The homogeneity index, defined as the number of cells with precise count of zero in the detailed histogram, is positively associated with the bias (markers show the mean difference between the DP count estimated by TopDown and the precise count, and shaded area shows the distribution of individual differences). This plot shows the association for enumeration districts, and a similar relationship holds at the county level. As
Counties displayed the same general pattern, but there are fewer counties and they typically have less empty strata, so it was not as pronounced. The size of this bias again decreased as a function of
We anticipate some readers of this will be social researchers who rely on Census Bureau data for quantitative work, and who have concerns that the Census Bureau is going to reduce the accuracy of this data. Such a reader may be open to the possibility that privacy is a valid reason for reducing accuracy, yet still be concerned about how this will affect their next decade of research. Our results visually summarized in
We also expect that some readers will be more drawn to the lower end of the epsilon curve. Just how private is TopDown with
Comparing error in total count or stratified count across levels of the geographic hierarchy reveals a powerful feature of the TopDown algorithm: the error is of similar magnitude even though the counts are substantially different in size. This is because the variation added at each level has been specified to have the same portion of the total privacy budget. It remains to be investigated how alternative allocations of privacy budget across levels will change the error and empirical privacy loss.
For
Accurate counts in small communities are important for emergency preparedness and other routine planning tasks performed by state and local government demographers, and this work may help to understand how such work will be affected by the shift to a DP disclosure avoidance system.
This work has not investigated more detailed research uses of decennial census data in social research tasks, such as segregation research, and how this may be affected by TopDown.
Another important use of decennial census data is in constructing control populations and survey weights for survey sampling of the US population for health, political, and public opinion polling. Our work provides some evidence on how TopDown may affect this application, but further work is warranted.
This work fits into the beginning of a discussion on how to best balance privacy and accuracy in decennial census data collection, and there is a need for continued discussion. This need must be balanced against a risky sort of observer bias—some researchers have hypothesized that calling attention to the privacy and confidentiality of census responses, even if done in a positive manner, could reduce the willingness of respondents to answer census questions, and ongoing investigation with surveys and cognitive testing may provide some evidence on the magnitude of this effect as well as potential countermeasures ^{ 17 }.
There are many differences between the 1940 census data and the 2020 data to be collected next year. In addition to the US population being three times larger now, the analysis will have six geographic levels instead of four, ten times more race groups and over 60 times more age groups. We expect that this will yield detailed queries with typical precise count sizes even smaller than the stratified counts for enumeration districts we have examined here. We suspect that impact of this will likely be to slightly decrease accuracy and increase privacy loss, but the accuracy of our hypothesis remains to be seen.
In addition to the changes in the data, additional changes are planned for TopDown, such as a switch from independent geometrically distributed variation to the High Dimensional Matrix Mechanism. We expect this to increase the accuracy a small amount without changing the empirical privacy loss.
In this work, we have focused on the median of the absolute error, but the spread of this distribution is important as well, and in future work, researchers may wish to investigate the tails of this distribution. We have also focused on the empirical privacy loss for specific queries at specific geographic aggregations, and our exploration was not comprehensive. Therefore, it is possible that some other test statistic would demonstrate a larger empirical privacy loss than we have found with our approach. Our approach also assumes that the residuals for different locations in a single run are an acceptable proxy for the residuals from the same location across multiple runs. Although these are certainly different, we suspect that the difference is sufficiently small as to not affect our estimates substantially.
The TopDown algorithm will provide a provably
Individuallevel data from the 1940 US Census is available from IPUMS
These data are under Copyright of Minnesota Population Center, University of Minnesota. Access to the documentation is freely available without restriction; however, users must register before extracting data from the website.
The output of the TopDown algorithm when run on the 1940 US Census data is available to download from the US Census Bureau:
These data are under Copyright of the United States Census Bureau.
Zenodo: Extended data for Differential privacy in the 2020 US census, what will it do? Quantifying the accuracy/privacy tradeoff.
This project contains a full table of summary counts and errors for a range of levels of geographic hierarchy, stratification, and epsilon.
Zenodo: Supplementary Methods Appendix for Differential privacy in the 2020 US census, what will it do? Quantifying the accuracy/privacy tradeoff: Design and validation of Empirical Privacy Loss (EPL) metric.
This project contains additional details on the design and validation of the EPL metric used in this paper.
Extended data are available under the terms of the
Scripts to produce all results and figures in this paper are available online:
Archived scripts at time of publication:
License:
Thanks to Neil Marquez (University of Washington) for suggesting comparing TopDown to simple random sampling. Thanks to danah boyd, Cynthia Dwork, Simson Garfinkel, Philip Leclerc, and Kunal Talwar for their helpful comments and discussion of this work.
I am pleased with the authors' responses to my initial review. Adding extra data in the supplementary materials provides a fuller picture of the analyses they executed, and the clarifications added to the text enhance understanding. I also greatly appreciate the new Supplementary Methods Appendix that provides a more detailed discussion of the EPL measure.
Is the work clearly and accurately presented and does it cite the current literature?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
Reviewer Expertise:
geography, demography, census data, differential privacy
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
I am happy with the authors' response to my comments and with their new revised paper.
While I would have liked to see a more formal analysis and description of the method evaluated, I also understand the authors’ desire to keep the article accessible to a wider audience.
To conclude, I believe that this article could be accepted without further revision.
Is the work clearly and accurately presented and does it cite the current literature?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are the conclusions drawn adequately supported by the results?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
Reviewer Expertise:
Artificial Intelligence, Differential Privacy, Optimization
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
The paper examines the behavior of TopDown, a privacypreserving algorithm proposed to release differentially private US Census data. The authors examine the privacy, accuracy, and bias tradeoff induced by the application of TopDown on the 1940 US Census dataset. The analysis was detailed for various privacy loss levels (i.e., epsilon values) and compared against a simple random sampling approach.
The authors provide a brief overview of Differential Privacy and the TopDown algorithm. Next, they introduce the empirical privacy loss as an empirical quantification of the loss of privacy induced by the application of a differentially private mechanism, and, finally, they provide an extensive evaluation on an application of TopDown on the 1940 US Census data release.
An interesting aspect of this work is the introduction of a novel evaluation metric, called "empirical privacy loss" or EPL. The authors argue that the use of the postprocessing strategy adopted by TopDown, that projects the differentially private solution into a feasible space, may reduce the theoretical privacy loss and the experimental evaluation seem to support such claim. In particular, the authors found that the EPL for a given class of counts (total count and stratified count) is smaller than the theoretical privacy loss guaranteed by the algorithm. I have several comments about this metric, reported in the detailed comments section.
I found this work original, in that it provides an extensive evaluation of the privacy, accuracy, and bias tradeoff of the TopDown algorithm. However, I also found the absence of a related work section unusual and would like to point out that there are other works that use optimization techniques to publish accurate count statistics, e.g.:
Michael Hay, Vibhor Rastogi, Gerome Miklau, Dan Suciu: Boosting the Accuracy of Differentially Private Histograms Through Consistency. PVLDB 3(1): 10211032 (2010) ^{ 1 }.
Chao Li, Gerome Miklau, Michael Hay, Andrew McGregor, Vibhor Rastogi: The matrix mechanism: optimizing linear counting queries under differential privacy. VLDB J. 24(6): 757781 (2015) ^{ 2 }.
YuHsuan Kuo, ChoChun Chiu, Daniel Kifer, Michael Hay, Ashwin Machanavajjhala: Differentially Private Hierarchical CountofCounts Histograms. PVLDB 11(11): 15091521 (2018) ^{ 3 }.
Ferdinando Fioretto, Pascal Van Hentenryck: Differential Privacy of Hierarchical Census Data: An Optimization Approach. CP 2019: 639655 ^{ 4 }.
The paper is well organized and described with a good amount of detail. However, I would have liked to see a more formal description of the TopDown algorithm and of the empirical privacy loss concept. In particular, I believe that describing TopDown using an optimization model would greatly simplify readability and avoid some doubts, such as those I list in my detailed comments. I would also suggest the authors introduce an illustration of the hierarchy utilized by the Census, together with the amount of privacy budget used at each level. This could, for instance, be visualized as a tree, where the root node describes the total counts at the national level, its children describe counts at the state level, and so on. I believe that such an illustration will ease visualizing the process performed by TopDown during Step 2, in order to satisfy the consistency of the problem constraints.
It would also be useful to have a table summarizing the problem constraints. For example, the authors describe equalities constraints, such as those that constrain the aggregate statistics and counts as well as those that force the invariants, and inequality constraints, such as nonnegativity and properties over the group sizes.
The authors provide a helpful overview of the TopDown algorithm, which operates in two steps: Noise addition and Optimization. I believe that the description can be further improvedI found the text to be quite verboseand would encourage the authors to supply the following information:
A table that summarizes the attributes of the histograms to be produced (e.g., counts of each geographic by age, race, ethnicity, household/group quarters) and the aggregate statistics.
An illustration highlighting the dependence between counts, and, thus, the constraints arising from these dependencies.
The authors call "aggregate statistics" as "DP queries". I am not sure why this terminology was selected. At the best of my knowledge, a DP query is simply a function over a dataset that happens to satisfy DP. I would suggest using a different terminology for identifying private aggregates.
At the end of the third paragraph of
In
I found the introduction of the empirical privacy loss concept quite interesting. However, I also have a few reservations. First, I think that the formula in this section could be described in more detail. I may have missed something, but I could not find what C correspond to. Also, this formula seems to be hard to compute and I wish the authors have spent a few words on they address such a challenge.
The notation \hat{p}_k used in the formula \Pr[error … ] seems to have the same semantic of notation \hat{p}(x), introduced in point (2) of Section “
In section
On point (1): I suggest spacing the epsilon values listed;
On point (4): I wonder if the authors have some intuitions on why the test run used more budget for aggregated statistics than for aggregated queries. I believe it would be very insightful to discuss the implications of such budget partitioning.
In section
Point (2): I would have liked if the authors could have further elaborated on how the empirical privacy loss is computed. Is it the maximum among all x of ELP(x)?
The authors specify that the EPL is computed for the total count and they report a substantially lower loss than the theoretical privacy budget adopted. Since the privacy budget was partitioned among several levels and queries, I wonder if the authors have taken such partitioning into account when computing the final EPL score. I believe this aspect should be discussed in the text.
Have the authors validated the fidelity of the EPL score on a simple differential privacy application? For instance, I would have liked to see a brief discussion on if this metric is in agreement with the theoretical errors provided by the Laplace mechanism on counting queries (without postprocessing).
The authors explain in detail the results attained in their analysis. I found the reporting of the results at the end of each subsection to be a bit distracting. I suggest the authors introduce one or multiple tables that tabulate the results and only summarize them in the text.
Additionally, the plots in Figure 1 and the errors describes in the text are for different privacy budget: The figure illustrates the errors for epsilon = 0.5, 1.0, and 2.0, while the text describes the errors for epsilon = 0.25, 1.0, and 4.0. I suggest the authors reporting the results for all the epsilon tested into a table, or to make the description in the text and the figure consistent for the privacy budgets adopted.
The empirical privacy loss computed was reported for the total count at the enumeration district level and countrylevel and compared against the privacy budget adopted by the TopDown algorithm. As stated in my comment above, I wonder if this comparison is fair. TopDown seems to partition the privacy budget for different queries, thus leaving the total count queries with substantially less budget than the original total one. I encourage the author to expand on this aspect of the evaluation.
As for the previous section, I recommend the authors to use a table to tabulate the numerical results described in the last paragraph. In my opinion, it will substantially increase readability.
As for the previous section, I suggest the authors tabulate the results of the homogeneity index and bias.
Are the errors by homogeneity index an average over the sample runs?
Is the work clearly and accurately presented and does it cite the current literature?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are the conclusions drawn adequately supported by the results?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
Reviewer Expertise:
Artificial Intelligence, Differential Privacy, Optimization
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The paper examines the behavior of TopDown, a privacypreserving algorithm proposed to release differentially private US Census data. The authors examine the privacy, accuracy, and bias tradeoff induced by the application of TopDown on the 1940 US Census dataset. The analysis was detailed for various privacy loss levels (i.e., epsilon values) and compared against a simple random sampling approach.
The authors provide a brief overview of Differential Privacy and the TopDown algorithm. Next, they introduce the empirical privacy loss as an empirical quantification of the loss of privacy induced by the application of a differentially private mechanism, and, finally, they provide an extensive evaluation on an application of TopDown on the 1940 US Census data release.
An interesting aspect of this work is the introduction of a novel evaluation metric, called "empirical privacy loss" or EPL. The authors argue that the use of the postprocessing strategy adopted by TopDown, that projects the differentially private solution into a feasible space, may reduce the theoretical privacy loss and the experimental evaluation seem to support such claim. In particular, the authors found that the EPL for a given class of counts (total count and stratified count) is smaller than the theoretical privacy loss guaranteed by the algorithm. I have several comments about this metric, reported in the detailed comments section.
I found this work original, in that it provides an extensive evaluation of the privacy, accuracy, and bias tradeoff of the TopDown algorithm. However, I also found the absence of a related work section unusual and would like to point out that there are other works that use optimization techniques to publish accurate count statistics, e.g.:
Michael Hay, Vibhor Rastogi, Gerome Miklau, Dan Suciu: Boosting the Accuracy of Differentially Private Histograms Through Consistency. PVLDB 3(1): 10211032 (2010)
Chao Li, Gerome Miklau, Michael Hay, Andrew McGregor, Vibhor Rastogi: The matrix mechanism: optimizing linear counting queries under differential privacy. VLDB J. 24(6): 757781 (2015)
YuHsuan Kuo, ChoChun Chiu, Daniel Kifer, Michael Hay, Ashwin Machanavajjhala: Differentially Private Hierarchical CountofCounts Histograms. PVLDB 11(11): 15091521 (2018)
Ferdinando Fioretto, Pascal Van Hentenryck: Differential Privacy of Hierarchical Census Data: An Optimization Approach. CP 2019: 639655
The paper is well organized and described with a good amount of detail. However, I would have liked to see a more formal description of the TopDown algorithm and of the empirical privacy loss concept. In particular, I believe that describing TopDown using an optimization model would greatly simplify readability and avoid some doubts, such as those I list in my detailed comments. I would also suggest the authors introduce an illustration of the hierarchy utilized by the Census, together with the amount of privacy budget used at each level. This could, for instance, be visualized as a tree, where the root node describes the total counts at the national level, its children describe counts at the state level, and so on. I believe that such an illustration will ease visualizing the process performed by TopDown during Step 2, in order to satisfy the consistency of the problem constraints.
It would also be useful to have a table summarizing the problem constraints. For example, the authors describe equalities constraints, such as those that constrain the aggregate statistics and counts as well as those that force the invariants, and inequality constraints, such as nonnegativity and properties over the group sizes.
The authors provide a helpful overview of the TopDown algorithm, which operates in two steps: Noise addition and Optimization. I believe that the description can be further improvedI found the text to be quite verboseand would encourage the authors to supply the following information:
A table that summarizes the attributes of the histograms to be produced (e.g., counts of each geographic by age, race, ethnicity, household/group quarters) and the aggregate statistics.
An illustration highlighting the dependence between counts, and, thus, the constraints arising from these dependencies.
The authors call "aggregate statistics" as "DP queries". I am not sure why this terminology was selected. At the best of my knowledge, a DP query is simply a function over a dataset that happens to satisfy DP. I would suggest using a different terminology for identifying private aggregates.
At the end of the third paragraph of
In
I found the introduction of the empirical privacy loss concept quite interesting. However, I also have a few reservations. First, I think that the formula in this section could be described in more detail. I may have missed something, but I could not find what C correspond to. Also, this formula seems to be hard to compute and I wish the authors have spent a few words on they address such a challenge.
The notation \hat{p}_k used in the formula \Pr[error … ] seems to have the same semantic of notation \hat{p}(x), introduced in point (2) of Section “
In section
On point (1): I suggest spacing the epsilon values listed;
On point (4): I wonder if the authors have some intuitions on why the test run used more budget for aggregated statistics than for aggregated queries. I believe it would be very insightful to discuss the implications of such budget partitioning.
In section
Point (2): I would have liked if the authors could have further elaborated on how the empirical privacy loss is computed. Is it the maximum among all x of ELP(x)?
The authors specify that the EPL is computed for the total count and they report a substantially lower loss than the theoretical privacy budget adopted. Since the privacy budget was partitioned among several levels and queries, I wonder if the authors have taken such partitioning into account when computing the final EPL score. I believe this aspect should be discussed in the text.
Have the authors validated the fidelity of the EPL score on a simple differential privacy application? For instance, I would have liked to see a brief discussion on if this metric is in agreement with the theoretical errors provided by the Laplace mechanism on counting queries (without postprocessing).
eps EPL lower upper
0.0010 0.0010 0.0008 0.0013
The authors explain in detail the results attained in their analysis. I found the reporting of the results at the end of each subsection to be a bit distracting. I suggest the authors introduce one or multiple tables that tabulate the results and only summarize them in the text.
Additionally, the plots in Figure 1 and the errors describes in the text are for different privacy budget: The figure illustrates the errors for epsilon = 0.5, 1.0, and 2.0, while the text describes the errors for epsilon = 0.25, 1.0, and 4.0. I suggest the authors reporting the results for all the epsilon tested into a table, or to make the description in the text and the figure consistent for the privacy budgets adopted.
The empirical privacy loss computed was reported for the total count at the enumeration district level and countrylevel and compared against the privacy budget adopted by the TopDown algorithm. As stated in my comment above, I wonder if this comparison is fair. TopDown seems to partition the privacy budget for different queries, thus leaving the total count queries with substantially less budget than the original total one. I encourage the author to expand on this aspect of the evaluation.
As for the previous section, I recommend the authors to use a table to tabulate the numerical results described in the last paragraph. In my opinion, it will substantially increase readability.
As for the previous section, I suggest the authors tabulate the results of the homogeneity index and bias.
Are the errors by homogeneity index an average over the sample runs?
Is the work clearly and accurately presented and does it cite the current literature?
Is the study design appropriate and is the work technically sound?
Are sufficient details of methods and analysis provided to allow replication by others?
If applicable, is the statistical analysis and its interpretation appropriate?
Are all the source data underlying the results available to ensure full reproducibility?
Are the conclusions drawn adequately supported by the results?
Using differentially private 1940 census data produced by the US Census Bureau's TopDown algorithm, Petti and Flaxman assess the privacy/accuracy tradeoff along multiple dimensions for this algorithm for multiple values of epsilon. The authors analyzed the median absolute error, empirical privacy loss, and bias for the differentially private data. They also compared the median absolute error and empirical privacy loss for differentially private data with data generated through simple random sampling. This is one of the first, if not the first, article assessing the accuracy of decennial census data published through a differentially private algorithm.
Petti and Flaxman provide a good overview of differential privacy and the Census Bureau's TopDown algorithm  a differentially private algorithm for producing decennial census data. They then compare the differentially private 1940 data with the original completecount 1940 data to assess the accuracy introduce by the TopDown algorithm. They find that error increased as the total privacy loss budget decreased. They also find that empirical privacy loss was smaller than total privacy loss budget. They measure bias introduced by the algorithm and find that bias increases as homogeneity decreases and that bias increases as total privacy loss budget decreases. They conclude that privacy loss does not vary much for epsilon < 1.0, and that the accuracy achieved when using a 50% simple random sample is equivalent to an epsilon of 1.0.
I am intrigued by the empirical privacy loss measure introduced by Petti and Flaxman. Its formula and interpretation mirrors the formula for epsilondifferential privacy. However, I would like to see a more thorough discussion of empirical privacy loss summary statistic reported in the results section of the paper. The authors compare an empirical privacy loss summary statistic with total privacy loss budget on pages 6 and 7 of the paper, but they never explain how the summary statistic was computed. Having that explanation would help me better understand the comparison they make throughout the paper.
The authors compare the empirical privacy loss for a given geographic unittype of count (total count, stratified count) combination with the overall privacy loss budget. They empirical privacy loss for a given combination is less than the overall privacy loss budget. I wonder if this is the correct comparison to make. The privacy loss budget controls the overall amount of privacy leaked by the publication of all statistics. It is the sum, via sequential composition, of the epsilon fractions assigned to each geographic levelstatistic combination. Thus, by definition, the empirical privacy loss associated with a particular geographic levelstatistic (e.g., total population count) must be less than the privacy loss budget. I would like to see a fuller discussion of this comparison in the paper. See detailed comment #14 for more details.
I would also additional supplemental datasets (or tables in the paper) with the empirical privacy loss summary statistics for all values of epsilon. The authors report a few values in the text and figures, but having a complete set would allow for a more comprehensive understanding of the relationship between empirical privacy loss and epsilon.
Finally, I strongly recommend that the authors use the same examples in their text as they use in the figures. The text uses epsilons of 0.25, 1.0 and 4.0 and the figures use epsilons of 0.50, 1.0, and 2.0. Making the epsilons consistent between the text and figures will help the reader better understand the analysis.
The authors' high level overview (first paragraph in subsection entitled "TopDown algorithm") describe the noise injection (Imprecise Histogram) and optimization steps in the TopDown algorithm. They state that the "second step (Optimize) adjusts the histogram to be close as possible to the imprecise counts". I am uncertain about what histogram the authors refer to in this sentence. Is the histogram based on the original data, or is this the noiseinjected detailed histogram? My understanding of the algorithm is that is generates histograms (one for each combination of geographic level and query) from the original data and then injects noise into histograms using the appropriate twosided geometric distribution. It then passes these noiseinjected histograms to the optimization function.
I would like the authors to be more precise in their description of the histogram and the "imprecise counts" in this section.The authors state that the 2020 US Census will have six geographic levels nested hierarchically (last sentence of TopDown algorithm paragraph). The Census Bureau allocated privacy loss budget to seven nested geographies (nation, state, county, tract group, census tract, block group, block) for the 2010 demonstration product. The Bureau has not committed to this allocation for 2020 and could still change the allocation strategy. I recommend clarifying that statement to pertain solely to the 2010 demonstration data product.
In the final clause of the last sentence of the TopDown algorighm paragraph, the authors state that "in the 1940 E2E test, only national, state, county, and district levels were included." I recommend adding the word "enumeration" before district in that clause.
At the end of first paragraph in this section, the authors describe the "ethnicityage" aggregate statistic set. The implication of this sentence is that the "ethnicityage" aggregate statistics set was one preselected by Census for noise injection. Census did not choose this aggregate statistic set. The aggregate statistic sets chosen by census were Voting age by Hispanic origin by Race (a 2 x 2 x 6 cell query) and Household/Group quarter (a 6 cell query). I recommend modifying this sentence to describe one of the two preselected aggregate statistic sets.
At the end of the second paragraph, the authors write that "22.5% spent on the groupquarters queries". I recommend changing the fragment to be "22.5% spent on the household/groupquarters queries". The word "household" is important when discussing this DP query. People can either live in household or group quarters, and by definition, households are not group quarters.
For option 3, I recommend modifying the "(and therefore aggregated over "group quarters types)" to be "(there therefore aggregated over "household/group quarters types)". A household is not a type of group quarter.
Also in option 3, I recommend modifying the "(ii) the groupquarters counts" to be "(ii) the household/group quarters counts".
In option 5, add the word "population" between "total" and "count" in the second sentence. Otherwise, readers will not necessarily know which total count to which the authors are referring.
At the end of first paragraph of this subsection, the authors list the median and 95th percentile of TC for EDs, counties, and states. I think it is important to clarify that these counts are based on the original 1940 census data and not on any of the differentially private 1940 datasets. Since this sentence comes at the end of a paragraph describing median absolute error, readers may assume the medians and 95th percentiles are from a DP dataset. Consider moving that sentence up the start of the paragraph.
At the end of second paragraph of this subsection, the authors list the median and 95th percentile of SC for EDs, counties, and states. I think it is important to clarify that these counts are based on the original 1940 census data and not on any of the differentially private 1940 datasets. Since this sentence comes at the end of a paragraph describing median absolute error, readers may assume the medians and 95th percentiles are from a DP dataset. Consider moving that sentence up the start of the paragraph.
The final two paragraphs of this subsection describe the empirical privacy loss for TC and SC for different geographic levels and different epsilons. They describe the EPL for epsilons of 0.25, 1.0, and 4.0 in the text. I would like to have a table, either in the paper or in the extended data product, that lists the EPLs for all values of epsilon and all geographic levels for TC and SC. I wonder how linear the relationship between EPL and epsilon is.
The authors list a number of EPL values in the final two paragraphs and in the righthand panel of Figure 1, but I do not know what the EPL value represents. Is it the absolute value of the maximum observed EPL, or is it the range from the maximum to minimum observed EPL value? I would appreciate a more complete discussion of how the authors calculated the value of EPL they plot in Figure 1 and list in the text. The formula on page 5 describes how to compute EPL for a single geographic unit and value of epsilon, but I don't see how that formula extends to the summary statistics reported on page 6.
Figure 1 plots the error and EPL for epsilon equal to 0.5, 1.0, and 2.0, but the text in the final two paragraphs describes EPL for epsilons of 0.25, 1.0, and 4.0. I strongly recommend making the values in the text and the plot consistent with one another. That consistency will make it easier to interpret the plot in Figure 1.
The authors compare the empirical privacy loss for a given geographic unittype of count (total count, stratified count) combination with the overall privacy loss budget. They empirical privacy loss for a given combination is less than the overall privacy loss budget. I wonder if this is the correct comparison to make. The privacy loss budget controls the overall amount of privacy leaked by the publication of all statistics. It is the sum, via sequential composition, of the epsilon fractions assigned to each geographic levelstatistic combination. Thus, by definition, the empirical privacy loss associated with a particular geographic levelstatistic (e.g., total population count) must be less than the privacy loss budget.
Geographic levels = 0.25 to each level
Tables = 0.1 (detailed), 0.225 (householdgroup quarters), 0.675 (voting age  Hispanic  race)
We can multiply the geographic level fraction by the table fractions by epsilon to yield:
Geog level  detailed query = 0.00625 epsilon
Geog level  household group quarters query = 0.0140625 epsilon
Geog level  voting age  Hispanic  race query = 0.0421875 epsilon
These epsilons still do not equate to an epsilon associated with a particular statistic, such as total population count. Given the optimization step and the statelevel total population invariant, I'm not sure if we can compute an epsilon value for a particular statistic. But these epsilon values seems like a more appropriate comparison to the empirical privacy loss reported by the authors.
I would like to have a table of MAE and EPL values for Simple Random Sampling. Consider adding those values to the Extended Data product currently available, or adding another Extended Data product with these values.
Consider adding a plot of EPL by sample size to supplement or even replace the final paragraph of this subsection. There are a lot of numbers in the final paragraph, and I find it difficult to visualize the relationship between EPL and sampling fraction just by reading the numbers.
The xaxis for Figure 2 depicts values of Empirical Privacy Loss, but neither the text nor the caption describe how the values were computed. This comment fits with comment 12  what does the Empirical Privacy Loss summary statistic mean and how was it computed.
Figure 3 plots the error and EPL for epsilon equal to 0.5, 1.0, and 2.0, but the text in the first paragraph describes EPL by homogeneity index for epsilons of 0.25, 1.0, and 4.0. I strongly recommend making the values in the text and the plot consistent with one another. That consistency will make it easier to interpret the plot in Figure 3.
I recommend moving the (Figure 3) parenthetical to the end of the discussion on EPL by homogeneity for enumeration districts. Figure 3 only shows the results for enumeration districts, but the parenthetical comes after the discussion for counties.
In the paragraph and Figure 3, the authors list a summary statistic for bias by homogeneity index and epsilon. Is the summary statistic the mean or the median?
Figure 3 displays the violin plot/mean bias for 11 of 23 homogeneity index values. I recommend modifying the figure caption to indicate that the authors are only displaying some of the homogeneity index values on the plot.
I also recommend modifying the xaxis label to indicate that the homogeneity index values are for enumeration districts. That would help readers immediately understand what geographic units are being plotted.
Is the work clearly and accurately presented and does it cite the current literature?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
Reviewer Expertise:
geography, demography, census data, differential privacy
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Using differentially private 1940 census data produced by the US Census Bureau's TopDown algorithm, Petti and Flaxman assess the privacy/accuracy tradeoff along multiple dimensions for this algorithm for multiple values of epsilon. The authors analyzed the median absolute error, empirical privacy loss, and bias for the differentially private data. They also compared the median absolute error and empirical privacy loss for differentially private data with data generated through simple random sampling. This is one of the first, if not the first, article assessing the accuracy of decennial census data published through a differentially private algorithm.
Petti and Flaxman provide a good overview of differential privacy and the Census Bureau's TopDown algorithm  a differentially private algorithm for producing decennial census data. They then compare the differentially private 1940 data with the original completecount 1940 data to assess the accuracy introduce by the TopDown algorithm. They find that error increased as the total privacy loss budget decreased. They also find that empirical privacy loss was smaller than total privacy loss budget. They measure bias introduced by the algorithm and find that bias increases as homogeneity decreases and that bias increases as total privacy loss budget decreases. They conclude that privacy loss does not vary much for epsilon < 1.0, and that the accuracy achieved when using a 50% simple random sample is equivalent to an epsilon of 1.0.
I am intrigued by the empirical privacy loss measure introduced by Petti and Flaxman. Its formula and interpretation mirrors the formula for epsilondifferential privacy. However, I would like to see a more thorough discussion of empirical privacy loss summary statistic reported in the results section of the paper. The authors compare an empirical privacy loss summary statistic with total privacy loss budget on pages 6 and 7 of the paper, but they never explain how the summary statistic was computed. Having that explanation would help me better understand the comparison they make throughout the paper.
The authors compare the empirical privacy loss for a given geographic unittype of count (total count, stratified count) combination with the overall privacy loss budget. The empirical privacy loss for a given combination is less than the overall privacy loss budget. I wonder if this is the correct comparison to make. The privacy loss budget controls the overall amount of privacy leaked by the publication of all statistics. It is the sum, via sequential composition, of the epsilon fractions assigned to each geographic levelstatistic combination. Thus, by definition, the empirical privacy loss associated with a particular geographic levelstatistic (e.g., total population count) must be less than the privacy loss budget. I would like to see a fuller discussion of this comparison in the paper. See detailed comment #14 for more details.
I would also additional supplemental datasets (or tables in the paper) with the empirical privacy loss summary statistics for all values of epsilon. The authors report a few values in the text and figures, but having a complete set would allow for a more comprehensive understanding of the relationship between empirical privacy loss and epsilon.
Finally, I strongly recommend that the authors use the same examples in their text as they use in the figures. The text uses epsilons of 0.25, 1.0 and 4.0 and the figures use epsilons of 0.50, 1.0, and 2.0. Making the epsilons consistent between the text and figures will help the reader better understand the analysis.
The authors' high level overview (first paragraph in subsection entitled "TopDown algorithm") describe the noise injection (Imprecise Histogram) and optimization steps in the TopDown algorithm. They state that the "second step (Optimize) adjusts the histogram to be close as possible to the imprecise counts". I am uncertain about what histogram the authors refer to in this sentence. Is the histogram based on the original data, or is this the noiseinjected detailed histogram? My understanding of the algorithm is that is generates histograms (one for each combination of geographic level and query) from the original data and then injects noise into histograms using the appropriate twosided geometric distribution. It then passes these noiseinjected histograms to the optimization function.
I would like the authors to be more precise in their description of the histogram and the "imprecise counts" in this section.
The authors state that the 2020 US Census will have six geographic levels nested hierarchically (last sentence of TopDown algorithm paragraph). The Census Bureau allocated privacy loss budget to seven nested geographies (nation, state, county, tract group, census tract, block group, block) for the 2010 demonstration product. The Bureau has not committed to this allocation for 2020 and could still change the allocation strategy. I recommend clarifying that statement to pertain solely to the 2010 demonstration data product.
In the final clause of the last sentence of the TopDown algorithm paragraph, the authors state that "in the 1940 E2E test, only national, state, county, and district levels were included." I recommend adding the word "enumeration" before district in that clause.
At the end of first paragraph in this section, the authors describe the "ethnicityage" aggregate statistic set. The implication of this sentence is that the "ethnicityage" aggregate statistics set was one preselected by Census for noise injection. Census did not choose this aggregate statistic set. The aggregate statistic sets chosen by census were Voting age by Hispanic origin by Race (a 2 x 2 x 6 cell query) and Household/Group quarter (a 6 cell query). I recommend modifying this sentence to describe one of the two preselected aggregate statistic sets.
At the end of the second paragraph, the authors write that "22.5% spent on the groupquarters queries". I recommend changing the fragment to be "22.5% spent on the household/groupquarters queries". The word "household" is important when discussing this DP query. People can either live in household or group quarters, and by definition, households are not group quarters.
For option 3, I recommend modifying the "(and therefore aggregated over "group quarters types)" to be "(there therefore aggregated over "household/group quarters types)". A household is not a type of group quarter.
Also in option 3, I recommend modifying the "(ii) the groupquarters counts" to be "(ii) the household/group quarters counts".
In option 5, add the word "population" between "total" and "count" in the second sentence. Otherwise, readers will not necessarily know which total count to which the authors are referring.
At the end of first paragraph of this subsection, the authors list the median and 95th percentile of TC for EDs, counties, and states. I think it is important to clarify that these counts are based on the original 1940 census data and not on any of the differentially private 1940 datasets. Since this sentence comes at the end of a paragraph describing median absolute error, readers may assume the medians and 95th percentiles are from a DP dataset. Consider moving that sentence up the start of the paragraph.
At the end of second paragraph of this subsection, the authors list the median and 95th percentile of SC for EDs, counties, and states. I think it is important to clarify that these counts are based on the original 1940 census data and not on any of the differentially private 1940 datasets. Since this sentence comes at the end of a paragraph describing median absolute error, readers may assume the medians and 95th percentiles are from a DP dataset. Consider moving that sentence up the start of the paragraph.
The final two paragraphs of this subsection describe the empirical privacy loss for TC and SC for different geographic levels and different epsilons. They describe the EPL for epsilons of 0.25, 1.0, and 4.0 in the text. I would like to have a table, either in the paper or in the extended data product, that lists the EPLs for all values of epsilon and all geographic levels for TC and SC. I wonder how linear the relationship between EPL and epsilon is.
The authors list a number of EPL values in the final two paragraphs and in the righthand panel of Figure 1, but I do not know what the EPL value represents. Is it the absolute value of the maximum observed EPL, or is it the range from the maximum to minimum observed EPL value? I would appreciate a more complete discussion of how the authors calculated the value of EPL they plot in Figure 1 and list in the text. The formula on page 5 describes how to compute EPL for a single geographic unit and value of epsilon, but I don't see how that formula extends to the summary statistics reported on page 6.
Figure 1 plots the error and EPL for epsilon equal to 0.5, 1.0, and 2.0, but the text in the final two paragraphs describes EPL for epsilons of 0.25, 1.0, and 4.0. I strongly recommend making the values in the text and the plot consistent with one another. That consistency will make it easier to interpret the plot in Figure 1.
The authors compare the empirical privacy loss for a given geographic unittype of count (total count, stratified count) combination with the overall privacy loss budget. They empirical privacy loss for a given combination is less than the overall privacy loss budget. I wonder if this is the correct comparison to make. The privacy loss budget controls the overall amount of privacy leaked by the publication of all statistics. It is the sum, via sequential composition, of the epsilon fractions assigned to each geographic levelstatistic combination. Thus, by definition, the empirical privacy loss associated with a particular geographic levelstatistic (e.g., total population count) must be less than the privacy loss budget.
For a given value of epsilon, we can compute the portion of that value that is assigned to each geographic level  query combination. For example, epsilon of 0.25 is divided up as follows:
Geographic levels = 0.25 to each level
Tables = 0.1 (detailed), 0.225 (householdgroup quarters), 0.675 (voting age  Hispanic  race)
We can multiply the geographic level fraction by the table fractions by epsilon to yield:
Geog level  detailed query = 0.00625 epsilon
Geog level  household group quarters query = 0.0140625 epsilon
Geog level  voting age  Hispanic  race query = 0.0421875 epsilon
These epsilons still do not equate to an epsilon associated with a particular statistic, such as total population count. Given the optimization step and the statelevel total population invariant, I'm not sure if we can compute an epsilon value for a particular statistic. But these epsilon values seems like a more appropriate comparison to the empirical privacy loss reported by the authors.
I would like to have a table of MAE and EPL values for Simple Random Sampling. Consider adding those values to the Extended Data product currently available, or adding another Extended Data product with these values.
Consider adding a plot of EPL by sample size to supplement or even replace the final paragraph of this subsection. There are a lot of numbers in the final paragraph, and I find it difficult to visualize the relationship between EPL and sampling fraction just by reading the numbers.
The xaxis for Figure 2 depicts values of Empirical Privacy Loss, but neither the text nor the caption describe how the values were computed. This comment fits with comment 12  what does the Empirical Privacy Loss summary statistic mean and how was it computed.
Figure 3 plots the error and EPL for epsilon equal to 0.5, 1.0, and 2.0, but the text in the first paragraph describes EPL by homogeneity index for epsilons of 0.25, 1.0, and 4.0. I strongly recommend making the values in the text and the plot consistent with one another. That consistency will make it easier to interpret the plot in Figure 3.
I recommend moving the (Figure 3) parenthetical to the end of the discussion on EPL by homogeneity for enumeration districts. Figure 3 only shows the results for enumeration districts, but the parenthetical comes after the discussion for counties.
In the paragraph and Figure 3, the authors list a summary statistic for bias by homogeneity index and epsilon. Is the summary statistic the mean or the median?
Figure 3 displays the violin plot/mean bias for 11 of 23 homogeneity index values. I recommend modifying the figure caption to indicate that the authors are only displaying some of the homogeneity index values on the plot.
I also recommend modifying the xaxis label to indicate that the homogeneity index values are for enumeration districts. That would help readers immediately understand what geographic units are being plotted.
Is the work clearly and accurately presented and does it cite the current literature?
Is the study design appropriate and is the work technically sound?
Are sufficient details of methods and analysis provided to allow replication by others?
If applicable, is the statistical analysis and its interpretation appropriate?
Are all the source data underlying the results available to ensure full reproducibility?
Are the conclusions drawn adequately supported by the results?