Change search
Link to record
Permanent link

Direct link
Publications (10 of 81) Show all publications
Hössjer, O., Díaz-Pachón, D. A., Zhao, C. & Rao, J. S. (2024). An Information Theoretic Approach to Prevalence Estimation and Missing Data. IEEE Transactions on Information Theory, 70(5), 3567-3582
Open this publication in new window or tab >>An Information Theoretic Approach to Prevalence Estimation and Missing Data
2024 (English)In: IEEE Transactions on Information Theory, ISSN 0018-9448, E-ISSN 1557-9654, Vol. 70, no 5, p. 3567-3582Article in journal (Refereed) Published
Abstract [en]

Many data sources, including tracking social behavior to election polling to testing studies for understanding disease spread, are subject to sampling bias whose implications are not fully yet understood. In this paper we study estimation of a given feature (such as disease, or behavior at social media platforms) from biased samples, treating non-respondent individuals as missing data. Prevalence of the feature among sampled individuals has an upward bias under the assumption of individuals’ willingness to be sampled. This can be viewed as a regression model with symptoms as covariates and the feature as outcome. It is assumed that the outcome is unknown at the time of sampling, and therefore the missingness mechanism only depends on the covariates. We show that data, in spite of this, is missing at random only when the sizes of symptom classes in the population are known; otherwise data is missing not at random. With an information theoretic viewpoint, we show that sampling bias corresponds to external information due to individuals in the population knowing their covariates, and we quantify this external information by active information. The reduction in prevalence, when sampling bias is adjusted for, similarly translates into active information due to bias correction, with opposite sign to active information due to testing bias. We develop unified results that show that prevalence and active information estimates are asymptotically normal under all missing data mechanisms, when testing errors are absent and present respectively. The asymptotic behavior of the estimators is illustrated through simulations.

Keywords
Active information, asymptotic normality, biased estimate, missing data, testing errors
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:su:diva-231614 (URN)10.1109/TIT.2023.3327399 (DOI)001217153500037 ()2-s2.0-85176319701 (Scopus ID)
Available from: 2024-08-07 Created: 2024-08-07 Last updated: 2024-08-07Bibliographically approved
Karlsson, M. & Hössjer, O. (2024). Classification under partial reject options. Journal of Classification, 41(1), 2-37
Open this publication in new window or tab >>Classification under partial reject options
2024 (English)In: Journal of Classification, ISSN 0176-4268, E-ISSN 1432-1343, Vol. 41, no 1, p. 2-37Article in journal (Refereed) Published
Abstract [en]

In many applications there is ambiguity about which (if any) of a finite number N of hypotheses that best fits an observation. It is of interest then to possibly output a whole set of categories, that is, a scenario where the size of the classified set of categories ranges from 0 to N. Empty sets correspond to an outlier, sets of size 1 represent a firm decision that singles out one hypothesis, sets of size N correspond to a rejection to classify, whereas sets of sizes 2,…,N−1 represent a partial rejection to classify, where some hypotheses are excluded from further analysis. In this paper, we review and unify several proposed methods of Bayesian set-valued classification, where the objective is to find the optimal Bayesian classifier that maximizes the expected reward. We study a large class of reward functions with rewards for sets that include the true category, whereas additive or multiplicative penalties are incurred for sets depending on their size. For models with one homogeneous block of hypotheses, we provide general expressions for the accompanying Bayesian classifier, several of which extend previous results in the literature. Then, we derive novel results for the more general setting when hypotheses are partitioned into blocks, where ambiguity within and between blocks are of different severity. We also discuss how well-known methods of classification, such as conformal prediction, indifference zones, and hierarchical classification, fit into our framework. Finally, set-valued classification is illustrated using an ornithological data set, with taxa partitioned into blocks and parameters estimated using MCMC. The associated reward function’s tuning parameters are chosen through cross-validation.

Keywords
Blockwise cross-validation, Bayesian classification, Conformal prediction · Classes of hypotheses, Indifference zones, Markov Chain Monte Carlo, Reward functions with set-valued inputs, Set-valued classifiers
National Category
Probability Theory and Statistics
Research subject
Mathematical Statistics
Identifiers
urn:nbn:se:su:diva-203754 (URN)10.1007/s00357-023-09455-x (DOI)001113203500001 ()2-s2.0-85178310510 (Scopus ID)
Note

 J Classif 41, 38 (2024). DOI: 10.1007/s00357-023-09459-7

Available from: 2022-04-21 Created: 2022-04-21 Last updated: 2024-10-21Bibliographically approved
Díaz-Pachón, D. A., Hössjer, O. & Mathew, C. (2024). Is It Possible to Know Cosmological Fine-tuning?. Astrophysical Journal Supplement Series, 271(2), Article ID 56.
Open this publication in new window or tab >>Is It Possible to Know Cosmological Fine-tuning?
2024 (English)In: Astrophysical Journal Supplement Series, ISSN 0067-0049, E-ISSN 1538-4365, Vol. 271, no 2, article id 56Article in journal (Refereed) Published
Abstract [en]

Fine-tuning studies whether some physical parameters, or relevant ratios between them, are located within so-called life-permitting intervals of small probability outside of which carbon-based life would not be possible. Recent developments have found estimates of these probabilities that circumvent previous concerns of measurability and selection bias. However, the question remains whether fine-tuning can indeed be known. Using a mathematization of the concepts of learning and knowledge acquisition, we argue that most examples that have been touted as fine-tuned cannot be formally assessed as such. Nevertheless, fine-tuning can be known when the physical parameter is seen as a random variable and it is supported in the nonnegative real line, provided the size of the life-permitting interval is small in relation to the observed value of the parameter.

Keywords
Anthropic principle, Bayesian statistics, Analytical mathematics, Cosmological constant, Cosmological parameters
National Category
Astronomy, Astrophysics and Cosmology
Identifiers
urn:nbn:se:su:diva-228713 (URN)10.3847/1538-4365/ad2c88 (DOI)001197225000001 ()2-s2.0-85189897895 (Scopus ID)
Available from: 2024-04-25 Created: 2024-04-25 Last updated: 2024-04-25Bibliographically approved
Thorvaldsen, S., Øhrstrøm, P. & Hössjer, O. (2024). The representation, quantification, and nature of genetic information. Synthese, 204(1), Article ID 15.
Open this publication in new window or tab >>The representation, quantification, and nature of genetic information
2024 (English)In: Synthese, ISSN 0039-7857, E-ISSN 1573-0964, Vol. 204, no 1, article id 15Article in journal (Refereed) Published
Abstract [en]

Current genetics studies often refer to notions from information science. The purpose of this paper is to summarize and structure the different notions of information used in biology, as a step towards developing a taxonomy of information. Within this framework we propose an extension of Floridi’s conceptual model of information. We also make use of the concept of specified information and show that functional information and many other notions of information are either special cases of, or are closely related to, specified information. Since functionality of the proteins that genes code serves as an external and independent specification, this makes it possible to define genetic information in a way that includes semantic aspects. In particular, we discuss how to understand the qualitative aspects of genetic information, how to measure its quantitative aspects, and how variants of Shannon’s information measure can be applied to molecular sequence data of protein families. While a mathematical framework may not be able to incorporate all that is included within biological information, some aspects of it allow for statistical modelling. This is especially true if we restrict our focus on the discipline of genetics. The concept of genetic information is still disputed because it attributes semantic traits to what seems to be regular biochemical entities. Some researchers maintain that the use of information in biology is just metaphorical and may even be misleading. We argue that the foundation of the metaphorical view is relatively weak given the current findings in bioinformatics and show that the present understanding of genetics fits well into the context of the modern philosophy of information. The paper concludes that informational concepts have robust scientific applications at the level of genes.

Keywords
Algorithms, Floridi, Functional information, Instructional information, Natural information, Self-information, Specified information
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:su:diva-238579 (URN)10.1007/s11229-024-04613-z (DOI)2-s2.0-85197946773 (Scopus ID)
Available from: 2025-01-28 Created: 2025-01-28 Last updated: 2025-01-28Bibliographically approved
Thorvaldsen, S. & Hössjer, O. (2024). Use of directed quasi-metric distances for quantifying the information of gene families. Biosystems (Amsterdam. Print), 243, Article ID 105256.
Open this publication in new window or tab >>Use of directed quasi-metric distances for quantifying the information of gene families
2024 (English)In: Biosystems (Amsterdam. Print), ISSN 0303-2647, E-ISSN 1872-8324, Vol. 243, article id 105256Article in journal (Refereed) Published
Abstract [en]

A large hindrance to analyzing information in genetic or protein sequence data has been a lack of a mathematical framework for doing so. In this paper, we present a multinomial probability space X as a general foundation for multicategory discrete data, where categories refer to variants/alleles of biosequences. The external information that is infused in order to generate a sample of such data is quantified as a distance on X between the prior distribution of data and the empirical distribution of the sample. A number of distances on X are treated. All of them have an information theoretic interpretation, reflecting the information that the sampling mechanism provides about which variants that have a selective advantage and therefore appear more frequently compared to prior expectations. This includes distances on X based on mutual information, conditional mutual information, active information, and functional information. The functional information distance is singled out as particularly useful. It is simple and has intuitive interpretations in terms of 1) a rejection sampling mechanism, where functional entities are retained, whereas non-functional categories are censored, and 2) evolutionary waiting times. The functional information is also a quasi-metric on X, with information being measured in an asymmetric, mountainous landscape. This quasi-metric property is also retained for a robustified version of the functional information distance that allows for mutations in the sampling mechanism. The functional information quasi-metric has been applied with success on bioinformatics data sets, for proteins and sequence alignment of protein families.

Keywords
Bioinformatics, Functional information, Multi-category data, Mutual information, Quasi-metrics, Self-information
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:su:diva-237916 (URN)10.1016/j.biosystems.2024.105256 (DOI)001281297100001 ()38871243 (PubMedID)2-s2.0-85199349935 (Scopus ID)
Available from: 2025-01-15 Created: 2025-01-15 Last updated: 2025-01-15Bibliographically approved
Allendorf, F. W., Hössjer, O. & Ryman, N. (2024). What does effective population size tell us about loss of allelic variation?. Evolutionary Applications, 17(6), Article ID e13733.
Open this publication in new window or tab >>What does effective population size tell us about loss of allelic variation?
2024 (English)In: Evolutionary Applications, E-ISSN 1752-4571, Vol. 17, no 6, article id e13733Article in journal (Refereed) Published
Abstract [en]

There are two primary measures of the amount of genetic variation in a population at a locus: heterozygosity and the number of alleles. Effective population size (Ne) provides both an expectation of the amount of heterozygosity in a population at drift-mutation equilibrium and the rate of loss of heterozygosity because of genetic drift. In contrast, the number of alleles in a population at drift-mutation equilibrium is a function of both Ne and census size (NC). In addition, populations with the same Ne can lose allelic variation at very different rates. Allelic variation is generally much more sensitive to bottlenecks than heterozygosity. Expressions used to adjust for the effects of violations of the ideal population on Ne do not provide good predictions of the loss of allelic variation. These effects are much greater for loci with many alleles, which are often important for adaptation. We show that there is a linear relationship between the reduction of NC and the corresponding reduction of the expected number of alleles at drift-mutation equilibrium. This makes it possible to predict the expected effect of a bottleneck on allelic variation. Heterozygosity provides good estimates of the rate of adaptive change in the short-term, but allelic variation provides important information about long-term adaptive change. The guideline of long-term Ne being greater than 500 is often used as a primary genetic metric for evaluating conservation status. We recommend that this guideline be expanded to take into account allelic variation as well as heterozygosity.

Keywords
allelic variation, bottleneck, drift-mutation equilibrium, effective population size, genetic drift, heterozygosity
National Category
Evolutionary Biology
Identifiers
urn:nbn:se:su:diva-235532 (URN)10.1111/eva.13733 (DOI)2-s2.0-85196665117 (Scopus ID)
Available from: 2024-11-14 Created: 2024-11-14 Last updated: 2024-11-14Bibliographically approved
Hössjer, O., Laikre, L. & Ryman, N. (2023). Assessment of the Global Variance Effective Size of Subdivided Populations, and Its Relation to Other Effective Sizes. Acta Biotheoretica, 71(3), Article ID 19.
Open this publication in new window or tab >>Assessment of the Global Variance Effective Size of Subdivided Populations, and Its Relation to Other Effective Sizes
2023 (English)In: Acta Biotheoretica, ISSN 0001-5342, E-ISSN 1572-8358, Vol. 71, no 3, article id 19Article in journal (Refereed) Published
Abstract [en]

The variance effective population size (N-eV) is frequently used to quantify the expected rate at which a population's allele frequencies change over time. The purpose of this paper is to find expressions for the global N-eV of a spatially structured population that are of interest for conservation of species. Since N-eV depends on allele frequency change, we start by dividing the cause of allele frequency change into genetic drift within subpopulations (I) and a second component mainly due to migration between subpopulations (II). We investigate in detail how these two components depend on the way in which subpopulations are weighted as well as their dependence on parameters of the model such a migration rates, and local effective and census sizes. It is shown that under certain conditions the impact of II is eliminated, and N-eV of the metapopulation is maximized, when subpopulations are weighted proportionally to their long term reproductive contributions. This maximal N-eV is the sought for global effective size, since it approximates the gene diversity effective size N-eGD, a quantifier of the rate of loss of genetic diversity that is relevant for conservation of species and populations. We also propose two novel versions of N-eV, one of which (the backward version of N-eV) is most stable, exists for most populations, and is closer to N-eGD than the classical notion of N-eV. Expressions for the optimal length of the time interval for measuring genetic change are developed, that make it possible to estimate any version of N-eV with maximal accuracy.

Keywords
Genetic diversity, Length of time interval, Matrix analytic recursions, Metapopulation, Migration-drift equilibrium, Perturbation theory of matrices, Variance effective size
National Category
Evolutionary Biology
Identifiers
urn:nbn:se:su:diva-221119 (URN)10.1007/s10441-023-09470-w (DOI)001032489500001 ()37458852 (PubMedID)2-s2.0-85158004417 (Scopus ID)
Available from: 2023-09-19 Created: 2023-09-19 Last updated: 2023-09-19Bibliographically approved
Zhou, L., Díaz-Pachón, D. A., Zhao, C., Rao, J. S. & Hössjer, O. (2023). Correcting prevalence estimation for biased sampling with testing errors. Statistics in Medicine, 42(26), 4713-4737
Open this publication in new window or tab >>Correcting prevalence estimation for biased sampling with testing errors
Show others...
2023 (English)In: Statistics in Medicine, ISSN 0277-6715, E-ISSN 1097-0258, Vol. 42, no 26, p. 4713-4737Article in journal (Refereed) Published
Abstract [en]

Sampling for prevalence estimation of infection is subject to bias by both oversampling of symptomatic individuals and error-prone tests. This results in naïve estimators of prevalence (ie, proportion of observed infected individuals in the sample) that can be very far from the true proportion of infected. In this work, we present a method of prevalence estimation that reduces both the effect of bias due to testing errors and oversampling of symptomatic individuals, eliminating it altogether in some scenarios. Moreover, this procedure considers stratified errors in which tests have different error rate profiles for symptomatic and asymptomatic individuals. This results in easily implementable algorithms, for which code is provided, that produce better prevalence estimates than other methods (in terms of reducing and/or removing bias), as demonstrated by formal results, simulations, and on COVID-19 data from the Israeli Ministry of Health.

Keywords
active information, bias correction, COVID-19, maximum entropy, prevalence, sampling, sampling bias, testing errors
National Category
Probability Theory and Statistics Public Health, Global Health and Social Medicine
Identifiers
urn:nbn:se:su:diva-225644 (URN)10.1002/sim.9885 (DOI)001122028600001 ()37655557 (PubMedID)2-s2.0-85169446081 (Scopus ID)
Available from: 2024-01-31 Created: 2024-01-31 Last updated: 2025-02-20Bibliographically approved
Kurland, S., Ryman, N., Hössjer, O. & Laikre, L. (2023). Effects of subpopulation extinction on effective size (Ne) of metapopulations. Conservation Genetics, 24(4), 417-433
Open this publication in new window or tab >>Effects of subpopulation extinction on effective size (Ne) of metapopulations
2023 (English)In: Conservation Genetics, ISSN 1566-0621, E-ISSN 1572-9737, Vol. 24, no 4, p. 417-433Article in journal (Refereed) Published
Abstract [en]

Population extinction is ubiquitous in all taxa. Such extirpations can reduce intraspecific diversity, but the extent to which genetic diversity of surviving populations are affected remains largely unclear. A key concept in this context is the effective population size (Ne), which quantifies the rate at which genetic diversity within populations is lost. Ne was developed for single, isolated populations while many natural populations are instead connected to other populations via gene flow. Recent analytical approaches and software permit modelling of Ne of interconnected populations (metapopulations). Here, we apply such tools to investigate how extinction of subpopulations affects Ne of the metapopulation (NeMeta) and of separate surviving subpopulations (NeRx) under different rates and patterns of genetic exchange between subpopulations. We assess extinction effects before and at migration-drift equilibrium. We find that the effect of extinction on NeMeta increases with reduced connectivity, suggesting that stepping stone models of migration are more impacted than island-migration models when the same number of subpopulations are lost. Furthermore, in stepping stone models, after extinction and before a new equilibrium has been reached, NeRx can vary drastically among surviving subpopulations and depends on their initial spatial position relative to extinct ones. Our results demonstrate that extinctions can have far more complex effects on the retention of intraspecific diversity than typically recognized. Metapopulation dynamics need heightened consideration in sustainable management and conservation, e.g., in monitoring genetic diversity, and are relevant to a wide range of species in the ongoing extinction crisis. 

Keywords
Inbreeding effective population size, Eigenvalue effective size, Realized effective size, Substructured populations, Conservation genetics
National Category
Genetics and Genomics Ecology
Identifiers
urn:nbn:se:su:diva-216315 (URN)10.1007/s10592-023-01510-9 (DOI)000953077900002 ()2-s2.0-85150289396 (Scopus ID)
Available from: 2023-04-12 Created: 2023-04-12 Last updated: 2025-02-01Bibliographically approved
Thorvaldsen, S. & Hössjer, O. (2023). Estimating the information content of genetic sequence data. The Journal of the Royal Statistical Society, Series C: Applied Statistics, 72(5), 1310-1338
Open this publication in new window or tab >>Estimating the information content of genetic sequence data
2023 (English)In: The Journal of the Royal Statistical Society, Series C: Applied Statistics, ISSN 0035-9254, E-ISSN 1467-9876, Vol. 72, no 5, p. 1310-1338Article in journal (Refereed) Published
Abstract [en]

A prominent problem in analysing genetic information has been a lack of mathematical frameworks for doing so. This article offers some new statistical methods to model and analyse information content in proteins, protein families, and their sequences. We discuss how to understand the qualitative aspects of genetic information, how to estimate the quantitative aspects of it, and implement a statistical model where the qualitative genetic function is represented jointly with its probabilistic metric of self-information. The functional information of protein families in the Cath and Pfam databases are estimated using a method inspired by rejection sampling. Scientific work may place these components of information as one of the fundamental aspects of molecular biology.

Keywords
functional information, mutual information, rejection sampling, self-information
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:su:diva-221410 (URN)10.1093/jrsssc/qlad062 (DOI)001032671000001 ()2-s2.0-85183119634 (Scopus ID)
Available from: 2023-09-20 Created: 2023-09-20 Last updated: 2024-03-04Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-2767-8818

Search in DiVA

Show all publications