CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Machine Learning Tools to Identify Risk Drivers in Water
Stockholm University, Faculty of Science, Department of Chemistry.ORCID iD: 0000-0002-8222-9962
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Due to the increasing number of chemicals used in our daily lives, more and more chemicals end up in the environment. Many such contaminants accumulate in water, with thousands of chemicals detected in environmental water samples using liquid chromatography – high-resolution mass spectrometry (LC/HRMS). As a result, all water-dependent organisms are exposed to a large number of low-concentration chemicals, while the health effects of such exposures are unknown. Unfortunately, only a small fraction of the detected chemicals is identified and can be further investigated for their effects on organisms.

This thesis investigated the opportunity to use experimental data of such detected but unidentified chemicals for predicting information regarding their environmental concentration levels, toxicity, and risk - the combination of both. Firstly, in paper I, the trends in risk estimation for chemicals detected in water samples were investigated across the years 2019 to 2022. The analysis indicated that risk was considered in only 13% of the papers. In paper II, a concentration prediction model, MS2Quant, was developed, allowing concentration prediction for unidentified chemicals based on tandem mass spectra. The experimental data-based concentration predictions were comparable with structure-based predictions. Further, in paper III, the predictions from the MS2Quant model were combined with in-house developed MS2Tox model for adult fish acute toxicity predictions in order to prioritize features in wastewater samples. While the feature set of the effluent samples was reduced by 73% to 99%, the subsequent structural assignment with library matching and in silico tools could not assign a probable structure for the majority of the prioritized features, highlighting the advantages of incorporating experimental data-based methods in the analysis. Finally, paper IV focused on the experimental validation of mixture toxicity predictions. For this, a complementary fish embryo acute toxicity model was developed, and the toxicity values were experimentally validated for eight chemicals. Combined with concentration predictions, the cumulative mixture toxicity was predicted with a 3× geometric mean error.

The tools developed, investigated, and validated in this thesis showcase the possibility of using available experimental data together with machine learning approaches for exposure and toxicity predictions of unidentified features. They allow looking into a larger subset of detected chemicals for subsequent tandem mass spectra-based prioritization of features that are more likely to cause harm and need immediate attention. 

Place, publisher, year, edition, pages
Stockholm: Department of Chemistry, Stockholm University , 2026. , p. 52
Keywords [en]
non-targeted screening, mass spectrometry, liquid chromatography, machine learning, exposure, toxicity, risk, prioritization
National Category
Analytical Chemistry
Research subject
Analytical Chemistry
Identifiers
URN: urn:nbn:se:su:diva-253770ISBN: 978-91-8107-574-8 (print)ISBN: 978-91-8107-575-5 (electronic)OAI: oai:DiVA.org:su-253770DiVA, id: diva2:2049408
Public defence
2026-05-15, Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16B, Stockholm, 09:00 (English)
Opponent
Supervisors
Available from: 2026-04-22 Created: 2026-03-30 Last updated: 2026-04-14Bibliographically approved
List of papers
1. Scientometric review: Concentration and toxicity assessment in environmental non-targeted LC/HRMS analysis
Open this publication in new window or tab >>Scientometric review: Concentration and toxicity assessment in environmental non-targeted LC/HRMS analysis
2023 (English)In: Trends in Environmental Analytical Chemistry, ISSN 2214-1588, Vol. 40, article id e00217Article, review/survey (Refereed) Published
Abstract [en]

Non-targeted screening with LC/HRMS is a go-to approach to discover relevant contaminants in environmental water samples that contain an abundance of chemicals. The rapidly increasing popularity of non-targeted LC/ HRMS screening has initiated development of a diverse set of methods for assessing the concentration and toxicity of the detected chemicals. This review aims to benchmark the trends in the environmental NTS literature with particular focus on (1) methods used for the quantification of tentatively identified chemicals that lack analytical standards, (2) methods for assessing the toxicity of detected chemicals, and (3) methods combining the former into a risk evaluation. Here we provide a scientometric review of these strategies based on the Web of Science referenced papers published between 2019 and 2022. General trends show that quantification and toxicity assessments are widely employed in NTS, reaching 66 % and 45 % over four years, respectively. Simultaneously, only 13 % of the papers covered here combine these results into a risk factor or similar. With this review we aim to highlight the advantages and gaps in the approaches used for concentration and toxicity assessment and provide guidelines for more homogeneous data interrogation and extrapolation.

Keywords
Liquid chromatography, High -resolution mass spectrometry, Toxicity, Quantification, Non -targeted screening, Suspect screening, Risk assessment, Environmental analysis, Effect directed analysis, Semi -quantification
National Category
Environmental Sciences
Identifiers
urn:nbn:se:su:diva-224223 (URN)10.1016/j.teac.2023.e00217 (DOI)001102329000001 ()2-s2.0-85174170811 (Scopus ID)
Available from: 2023-12-05 Created: 2023-12-05 Last updated: 2026-03-30Bibliographically approved
2. Bypassing the Identification: MS2Quant for Concentration Estimations of Chemicals Detected with Nontarget LC-HRMS from MS2 Data
Open this publication in new window or tab >>Bypassing the Identification: MS2Quant for Concentration Estimations of Chemicals Detected with Nontarget LC-HRMS from MS2 Data
Show others...
2023 (English)In: Analytical Chemistry, ISSN 0003-2700, E-ISSN 1520-6882, Vol. 95, no 33, p. 12329-12338Article in journal (Refereed) Published
Abstract [en]

Nontarget analysis by liquid chromatography-high-resolutionmass spectrometry (LC-HRMS) is now widely used to detect pollutants in the environment. Shifting away from targeted methods has led to detection of previously unseen chemicals, and assessing the risk posed by these newly detected chemicals is an important challenge. Assessing exposure and toxicity of chemicals detected with nontarget HRMS is highly dependent on the knowledge of the structure of the chemical. However, the majority of features detected in nontarget screening remain unidentified and therefore the risk assessment with conventional tools is hampered. Here, we developed MS2Quant, a machine learning model that enables prediction of concentration from fragmentation(MS2) spectra of detected, but unidentified chemicals. MS2Quant is an xgbTree algorithm-based regression model developed using ionization efficiency data for 1191 unique chemicals that spans 8 orders of magnitude. The ionization efficiency values are predicted from structural fingerprints that can be computed from the SMILES notation of the identified chemicals or from MS2 spectra of unidentified chemicals using SIRIUS+CSI: FingerID software. The root mean square errors of the training and test sets were 0.55(3.5x) and 0.80 (6.3x) log-units, respectively. In comparison, ionization efficiency prediction approaches that depend on assigning an unequivocal structure typically yield errors from 2x to 6x. The MS2Quant quantification model was validated on a set of 39 environmental pollutants and resulted in a mean prediction error of 7.4x, ageometric mean of 4.5x, and a median of 4.0x. For comparison, a model based on PaDEL descriptors that depends on unequivocal structural assignment was developed using the same dataset. The latter approach yielded a comparable mean prediction error of 9.5x, a geometricmean of 5.6x, and a median of 5.2x on the validation set chemicals when the top structural assignment was used as input. This confirms that MS2Quant enables to extract exposure information for unidentified chemicals which, although detected, have thus far been disregarded due to lack of accurate tools for quantification. TheMS2Quant model is available as an R-package in GitHub for improving discovery and monitoring of potentially hazardous environmental pollutants with nontarget screening.

National Category
Analytical Chemistry
Identifiers
urn:nbn:se:su:diva-220853 (URN)10.1021/acs.analchem.3c01744 (DOI)001042711000001 ()37548594 (PubMedID)2-s2.0-85168386106 (Scopus ID)
Available from: 2023-09-12 Created: 2023-09-12 Last updated: 2026-03-30Bibliographically approved
3. High-risk contaminants detected in wastewater effluent samples can be prioritized prior to structural assignment using machine learning tools
Open this publication in new window or tab >>High-risk contaminants detected in wastewater effluent samples can be prioritized prior to structural assignment using machine learning tools
Show others...
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Discovering hazardous pollutants currently relies on the tedious and often inaccurate structural identification step, required for further toxicity and exposure studies. Here, we propose and validate a workflow for prioritizing the detected features solely from their mass spectrometric data based on the priority score reflecting the risk posed by these chemicals. The workflow integrates two machine learning approaches, MS2Tox and MS2Quant, that predict toxicity and concentration of unidentified molecular features, respectively, and does not require any additional standards to be measured in the same run, allowing the application to digitally frozen data. Validation by using priority score for classifying 23 chemicals with available risk quotient values into high and low risk categories yielded recall value of 0.55 to 0.74, precision of 0.14 to 0.40, and accuracy of 0.46 to 0.69, depending on the acquisition mode and fish species. Applying the developed workflow to wastewater effluent prioritized 20-27% of featureswith predicted fingerprints as “precautionary risk features” with a risk quotient ≥1 based on the lower limit of the 95% prediction interval. All prioritized features were subject to spectral library matching, with 11% of the features yielding level 2 identification. Features categorized as high-risk were further subject to structural annotation using in silico identification tools SIRIUS+CSI:FingerID and MetFrag. While plausible candidates were suggested, the in silico tools rarely agreed within the top 10 suggested structures, highlighting structural assignment as a bottleneck that can be bypassed by data-driven feature prioritization.

Keywords
quantification, non-target screening, non-targeted screening, transformation products, persistent chemicals
National Category
Analytical Chemistry
Research subject
Analytical Chemistry
Identifiers
urn:nbn:se:su:diva-253767 (URN)
Available from: 2026-03-27 Created: 2026-03-27 Last updated: 2026-04-01
4. Toxic Unit Prediction of Mixtures Using High-Resolution Mass Spectrometric Data Combined with in vivo Exposure of Zebrafish Embryos
Open this publication in new window or tab >>Toxic Unit Prediction of Mixtures Using High-Resolution Mass Spectrometric Data Combined with in vivo Exposure of Zebrafish Embryos
Show others...
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Estimating the risk posed by a mixture of environmental contaminants is a complex task. Here, we aimed to predict toxic units for chemicals in mixtures to evaluate their contribution to the overall mixture toxicity. In order to validate the workflow, 36 designed mixtures representing different cumulative mixture toxicity scenarios were analyzed using liquid chromatography coupled to high-resolution mass spectrometry. Two fragmentation spectra approaches were investigated: mixture-specific fragmentation spectra (MS2) and MS2 spectra merged over a range of collision energies. To enable experimental validation, a fish embryo acute toxicity prediction model was developed, and the toxicity was experimentally determined for eight chemicals. Chemical-specific toxic units were predicted with a geometric mean fold error of 6-7× in positive and 6-21× in negative electrospray ionization mode, while the cumulative mixture toxicity was predicted with a geometric mean fold error of 2-3× for both structure and MS2-based methods. The merged MS2 spectra improved predictions compared to mixture-specific MS2 spectra and performed comparably with the structure-based predictions. These results demonstrate that both structure and MS2-based mixture toxicity predictions can be equally useful for further application on real-life samples. 

National Category
Analytical Chemistry
Research subject
Analytical Chemistry
Identifiers
urn:nbn:se:su:diva-253769 (URN)
Available from: 2026-03-27 Created: 2026-03-27 Last updated: 2026-04-01

Open Access in DiVA

Machine Learning Tools to Identify Risk Drivers in Water(2300 kB)11 downloads
File information
File name FULLTEXT01.pdfFile size 2300 kBChecksum SHA-512
56615e8f7c3f474d884ecf4f8109dc6952ba4e5ec5d9f1ba3aaede201f9808a80f87fa19698f787894a6851fa26126c0c82164ce5555227a3d32fa6959301372
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Sepman, Helen
By organisation
Department of Chemistry
Analytical Chemistry

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 224 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf