Manually annotate masses with database
Jan Stanstrup
2023-11-09
from_masses.RmdThis tutorial contains references to specific chromatograhic systems in use at NEXS, University of Copenhagen and might be confusing to others.
If you want a copy of this Rmarkdown tutorial you can do the following, which will create a copy in your working folder:
file.copy(system.file("doc/from_masses.Rmd", package="XCMSAnnotator"), ".")If you would rather have a script you can do:
knitr::purl(system.file("doc/from_masses.Rmd", package="XCMSAnnotator"))Load some required packages
library(tidyverse)
library(XCMSAnnotator)
library(MetaboAnnotation)
library(ProtGenerics)
library(readxl)
library(Spectra)
library(conflicted)
conflicts_prefer(dplyr::filter)
conflicts_prefer(MetaboAnnotation::matches)If you are missing packages you can run
devtools::install_gitlab("stanstrup_r_packages/xcms-annotator").
Creating Spectra objects
Manually creating a set of spectra with retention time and intensities
To understand how we can manually create a spectrum object we will do it first by hand.
First we make a table that contains the msLevel (MS1 or MS2) and retention time for two spectra:
We can then add the mz and intensities for these two peaks:
spd$mz <- list(c(100, 103.2, 104.3, 106.5),
c(45.6, 120.4, 190.2)
)
spd$intensity <- list(c(200, 400, 34.2, 17),
c(12.3, 15.2, 6.8)
)We can now make a Spectra object:
spectra_with_rt_and_int <- Spectra(spd)
spectra_with_rt_and_int## MSn data (Spectra) with 2 spectra in a MsBackendMemory backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 1.1 NA
## 2 2 1.2 NA
## ... 16 more variables/columns.
We can also re-create a table with the spectra:
spectra_with_rt_and_int %>%
peaksData %>%
as_tibble %>%
group_nest(group) %>%
bind_cols(as_tibble(spectraData(spectra_with_rt_and_int)[, c("msLevel", "rtime")]),.) %>%
unnest(data)## # A tibble: 7 × 6
## msLevel rtime group group_name mz intensity
## <int> <dbl> <int> <chr> <dbl> <dbl>
## 1 1 1.1 1 NA 100 200
## 2 1 1.1 1 NA 103. 400
## 3 1 1.1 1 NA 104. 34.2
## 4 1 1.1 1 NA 106. 17
## 5 2 1.2 2 NA 45.6 12.3
## 6 2 1.2 2 NA 120. 15.2
## 7 2 1.2 2 NA 190. 6.8
By reading the data from an excel file instead
A more comfortable way to read some manual spectra might be to read them from an Excel table like this:
spectra_from_excel <- read_xlsx(system.file("extdata/manual_spectra.xlsx", package="XCMSAnnotator"))
spectra_from_excel## # A tibble: 23 × 6
## ionmode group msLevel rtime mz intensity
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 pos 111 1 4.77 478. 1.12
## 2 pos 111 1 4.77 496. 100
## 3 pos 111 1 4.77 497. 19.6
## 4 pos 111 1 4.77 498. 2.24
## 5 pos 111 1 4.77 518. 5.01
## 6 pos 111 1 4.77 524. 1.69
## 7 pos 111 1 4.77 525. 1.04
## 8 pos 111 1 4.77 992. 14.1
## 9 pos 111 1 4.77 993. 6.02
## 10 pos 980 1 1.83 93.1 1.58
## # ℹ 13 more rows
Here we use the group column to uniquely identify each compound. So you can make your data look the same.
Now we then convert that to a spectra object:
spectra_from_excel_as_spectra <- spectra_from_excel %>%
chop(c(mz, intensity)) %>%
mutate(msLevel = as.integer(msLevel)) %>%
mutate(rtime = rtime*60) %>% # because we set rt in minutes but the DB is in seconds
Spectra
spectra_from_excel_as_spectra## MSn data (Spectra) with 3 spectra in a MsBackendMemory backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 286.377 NA
## 2 1 109.548 NA
## 3 2 109.797 NA
## ... 18 more variables/columns.
Matching to a database
Now we read your database of compounds that was created by MScurate and filter it for a specific machine. Here we are matching to the data from the Synapt.
DB <- readRDS(system.file("extdata/DB.rds", package="XCMSAnnotator"))
DB_selected <- DB %>%
spectraData %>%
as_tibble %>%
pull(dataOrigin) %>%
grep("Synapt", .) %>% # We want only the standards that have "Synapt" in the path.
{DB[.]}
DB_selected## MSn data (Spectra) with 1435 spectra in a MsBackendDataFrame backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 286.377 1135
## 2 2 286.129 1134
## 3 1 299.703 1188
## 4 1 289.383 1147
## 5 1 295.728 1172
## ... ... ... ...
## 1431 1 195.548 774
## 1432 2 195.796 775
## 1433 2 31.742 123
## 1434 1 29.513 116
## 1435 2 29.761 117
## ... 48 more variables/columns.
## Processing:
## Merge 2304 Spectra into one [Mon Sep 27 13:09:49 2021]
We can now match the DB and the peaklist. First we setup some matching criteria.
mp <- MatchForwardReverseParam(ppm = 30,
requirePrecursor = FALSE,
THRESHFUN = function(x) which(x > 0),
THRESHFUN_REVERSE = function(x) which(x >= 0.7), # so only keep matches if the reverse score is better than 0.7
toleranceRt = 0.2*60 # 0.2 min tolerance.
)We do the matching:
mtches <- matchSpectra(spectra_from_excel_as_spectra, DB_selected, param = mp)To avoid splitting both the database and the peaklist we have matching across polarities, which obvious doesn’t make sense. So here we remove matching that are not between spectra of the same polarity.
mtches_mode <- filterMisMatchedPolarity(mtches, query_mode_col = "ionmode", target_mode_col = "polarity")
mtches_mode## Object of class MatchedSpectra
## Total number of matches: 8
## Number of query objects: 3 (3 matched)
## Number of target objects: 1435 (5 matched)
As you can see in the summary all our 3 spectra have been found.
Now lets see what we found:
DB_selected %>%
spectraData %>%
as_tibble %>%
slice(matches(mtches_mode)$target_idx) %>%
mutate(query_idx = matches(mtches_mode)$query_idx, target_idx = matches(mtches_mode)$target_idx) %>%
select(query_idx, target_idx, name, msLevel, polarity, rtime, dataOrigin)## # A tibble: 8 × 7
## query_idx target_idx name msLevel polarity rtime dataOrigin
## <int> <int> <chr> <int> <int> <dbl> <chr>
## 1 1 1 LPC(16:0) 1 1 286. /media/hostsource…
## 2 1 2 LPC(16:0) 2 1 286. /media/hostsource…
## 3 2 419 L-Phenylalanine 1 1 110. /media/hostsource…
## 4 2 420 L-Phenylalanine 2 1 110. /media/hostsource…
## 5 2 632 L-Phenylalanine 1 1 110. /media/hostsource…
## 6 3 419 L-Phenylalanine 1 1 110. /media/hostsource…
## 7 3 420 L-Phenylalanine 2 1 110. /media/hostsource…
## 8 3 632 L-Phenylalanine 1 1 110. /media/hostsource…
You can also get more details about the match:
matches(mtches_mode)## query_idx target_idx score reverse_score presence_ratio
## 1 1 1 0.9435464 0.9435464 0.1607143
## 2 1 2 0.8070411 0.8070411 0.2195122
## 12 2 419 0.9457657 0.9457657 0.2608696
## 13 2 420 0.8498953 0.8870301 0.3333333
## 14 2 632 0.9524082 0.9642384 0.4545455
## 16 3 419 0.8699097 0.8699097 0.3478261
## 17 3 420 0.9692363 0.9692363 0.5333333
## 18 3 632 0.8763451 0.8835292 0.6363636
## matched_peaks_count
## 1 9
## 2 9
## 12 6
## 13 5
## 14 5
## 16 8
## 17 8
## 18 7
Matching just a mass to all spectra
Can we do it also if we just had a single mz we wanted to look-up? Yes!
Lets imagine we simply want to match this list:
spectra_from_excel_mz_only <- spectra_from_excel %>%
slice(c(2, 15, 21)) %>%
select(ionmode, group, mz)
spectra_from_excel_mz_only## # A tibble: 3 × 3
## ionmode group mz
## <chr> <dbl> <dbl>
## 1 pos 111 496.
## 2 pos 980 166.
## 3 pos 981 166.
We again can “fake” a Spectra object.
spectra_from_excel_mz_only_as_spectra <- spectra_from_excel_mz_only %>%
mutate(intensity = 100) %>% # it cannot be empty to we just put something
chop(c(mz, intensity)) %>%
Spectra
spectra_from_excel_mz_only_as_spectra## MSn data (Spectra) with 3 spectra in a MsBackendMemory backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## ... 18 more variables/columns.
We make the parameters to match by. This time we require only the mass to match no matter the score or RT.
mp <- MatchForwardReverseParam(ppm = 30,
requirePrecursor = FALSE,
THRESHFUN = function(x) which(x > 0),
THRESHFUN_REVERSE = function(x) which(x >= 0), # so all matches are kept
toleranceRt = Inf
)We do the matching:
mtches <- matchSpectra(spectra_from_excel_mz_only_as_spectra, DB_selected, param = mp)To avoid splitting both the database and the peaklist we have matching across polarities, which obvious doesn’t make sense. So here we remove matching that are not between spectra of the same polarity.
mtches_mode <- filterMisMatchedPolarity(mtches, query_mode_col = "ionmode", target_mode_col = "polarity")
mtches_mode## Object of class MatchedSpectra
## Total number of matches: 44
## Number of query objects: 3 (3 matched)
## Number of target objects: 1435 (24 matched)
This time we get 44 matches. So obviously some are wrong…
Now lets see what we found:
DB_selected %>%
spectraData %>%
as_tibble %>%
slice(matches(mtches_mode)$target_idx) %>%
mutate(query_idx = matches(mtches_mode)$query_idx, target_idx = matches(mtches_mode)$target_idx) %>%
select(query_idx, target_idx, name, msLevel, polarity, rtime, dataOrigin)## # A tibble: 44 × 7
## query_idx target_idx name msLevel polarity rtime dataOrigin
## <int> <int> <chr> <int> <int> <dbl> <chr>
## 1 1 1 LPC(16:0) 1 1 286. /media/hostsour…
## 2 1 2 LPC(16:0) 2 1 286. /media/hostsour…
## 3 1 9 LPC(17:1) 1 1 284. /media/hostsour…
## 4 1 744 LPC (18:1) 1 1 288. /media/hostsour…
## 5 2 161 stachydrine 1 1 36.5 /media/hostsour…
## 6 2 295 α-Methyl-L-DOPA 1 1 62.5 /media/hostsour…
## 7 2 319 Asp-Phe 1 1 163. /media/hostsour…
## 8 2 320 Asp-Phe 2 1 163. /media/hostsour…
## 9 2 414 Pantothenic acid 1 1 189. /media/hostsour…
## 10 2 415 Pantothenic acid 1 1 131. /media/hostsour…
## # ℹ 34 more rows
You can also get more details about the match:
matches(mtches_mode)## query_idx target_idx score reverse_score presence_ratio
## 1 1 1 0.6252763335 0.6252763335 0.01785714
## 2 1 2 0.4183212537 0.4183212537 0.02439024
## 3 1 9 0.0003687644 0.0003687644 0.00800000
## 4 1 744 0.0001975497 0.0001975497 0.00800000
## 7 2 161 0.0083635038 0.0083635038 0.09090909
## 8 2 295 0.3898961889 0.3898961889 0.04347826
## 9 2 319 0.2309524283 0.2309524283 0.02222222
## 10 2 320 0.0338301984 0.0338301984 0.02857143
## 11 2 414 0.0108036658 0.0108036658 0.04545455
## 12 2 415 0.0063866429 0.0063866429 0.02439024
## 13 2 419 0.0233418240 0.0233418240 0.04347826
## 14 2 420 0.0284218862 0.0284218862 0.06666667
## 15 2 570 0.3981241686 0.3981241686 0.05000000
## 16 2 583 0.2381783831 0.2381783831 0.02173913
## 17 2 632 0.0250502963 0.0250502963 0.09090909
## 18 2 877 0.1973823876 0.1973823876 0.05000000
## 19 2 878 0.1398388075 0.1398388075 0.08333333
## 20 2 955 0.0042252548 0.0042252548 0.01265823
## 21 2 1003 0.1982663747 0.1982663747 0.03846154
## 22 2 1004 0.0987288210 0.0987288210 0.09090909
## 23 2 1333 0.0695714147 0.0695714147 0.04761905
## 24 2 1334 0.0360922985 0.0360922985 0.10000000
## 25 2 1358 0.0481780181 0.0481780181 0.01754386
## 26 2 1359 0.0488774586 0.0488774586 0.02857143
## 31 3 161 0.0083635038 0.0083635038 0.09090909
## 32 3 295 0.3898961889 0.3898961889 0.04347826
## 33 3 319 0.2309524283 0.2309524283 0.02222222
## 34 3 320 0.0338301984 0.0338301984 0.02857143
## 35 3 414 0.0108036658 0.0108036658 0.04545455
## 36 3 415 0.0063866429 0.0063866429 0.02439024
## 37 3 419 0.0233418240 0.0233418240 0.04347826
## 38 3 420 0.0284218862 0.0284218862 0.06666667
## 39 3 570 0.3981241686 0.3981241686 0.05000000
## 40 3 583 0.2381783831 0.2381783831 0.02173913
## 41 3 632 0.0250502963 0.0250502963 0.09090909
## 42 3 877 0.1973823876 0.1973823876 0.05000000
## 43 3 878 0.1398388075 0.1398388075 0.08333333
## 44 3 955 0.0042252548 0.0042252548 0.01265823
## 45 3 1003 0.1982663747 0.1982663747 0.03846154
## 46 3 1004 0.0987288210 0.0987288210 0.09090909
## 47 3 1333 0.0695714147 0.0695714147 0.04761905
## 48 3 1334 0.0360922985 0.0360922985 0.10000000
## 49 3 1358 0.0481780181 0.0481780181 0.01754386
## 50 3 1359 0.0488774586 0.0488774586 0.02857143
## matched_peaks_count
## 1 1
## 2 1
## 3 1
## 4 1
## 7 1
## 8 1
## 9 1
## 10 1
## 11 1
## 12 1
## 13 1
## 14 1
## 15 1
## 16 1
## 17 1
## 18 1
## 19 1
## 20 1
## 21 1
## 22 1
## 23 1
## 24 1
## 25 1
## 26 1
## 31 1
## 32 1
## 33 1
## 34 1
## 35 1
## 36 1
## 37 1
## 38 1
## 39 1
## 40 1
## 41 1
## 42 1
## 43 1
## 44 1
## 45 1
## 46 1
## 47 1
## 48 1
## 49 1
## 50 1
OK, this is really only useful if you are really desperate. Way to many spurious matches to be generally useful.
But what if we try to remove really small peaks? We can do that by removing peaks below a certain intensity (here we cut everything below 5% of the highest peak):
above_rel_thres <- function(x, thres = 5) {
100*x/max(x, na.rm = TRUE) >= thres
}
DB_selected_norm_cut <- filterIntensity(DB_selected, above_rel_thres, thres = 5)And then run the matching again.
mp <- MatchForwardReverseParam(ppm = 30,
requirePrecursor = FALSE,
THRESHFUN = function(x) which(x > 0),
THRESHFUN_REVERSE = function(x) which(x >= 0), # so all matches are kept
toleranceRt = Inf
)
mtches <- matchSpectra(spectra_from_excel_mz_only_as_spectra, DB_selected_norm_cut, param = mp)
mtches_mode <- filterMisMatchedPolarity(mtches, query_mode_col = "ionmode", target_mode_col = "polarity")
mtches_mode## Object of class MatchedSpectra
## Total number of matches: 30
## Number of query objects: 3 (3 matched)
## Number of target objects: 1435 (16 matched)
DB_selected_norm_cut %>%
spectraData %>%
as_tibble %>%
slice(matches(mtches_mode)$target_idx) %>%
mutate(query_idx = matches(mtches_mode)$query_idx,
target_idx = matches(mtches_mode)$target_idx
) %>%
select(query_idx, target_idx, name, msLevel, polarity, rtime, dataOrigin)## # A tibble: 30 × 7
## query_idx target_idx name msLevel polarity rtime dataOrigin
## <int> <int> <chr> <int> <int> <dbl> <chr>
## 1 1 1 LPC(16:0) 1 1 286. /media/hostsourc…
## 2 1 2 LPC(16:0) 2 1 286. /media/hostsourc…
## 3 2 295 α-Methyl-L-DOPA 1 1 62.5 /media/hostsourc…
## 4 2 319 Asp-Phe 1 1 163. /media/hostsourc…
## 5 2 320 Asp-Phe 2 1 163. /media/hostsourc…
## 6 2 570 α-Methyl-L-DOPA 1 1 62.5 /media/hostsourc…
## 7 2 583 Asp-Phe 1 1 163. /media/hostsourc…
## 8 2 877 Val-Phe 1 1 192. /media/hostsourc…
## 9 2 878 Val-Phe 2 1 192. /media/hostsourc…
## 10 2 955 Trolox 2 1 244. /media/hostsourc…
## # ℹ 20 more rows
So yes, we got fewer hits… But we lost phenylalanine because the intensity is actually very low!
For something that might be more sensible check the next section.
Matching just a mass to all parent masses
So instead if matching against all peaks in all spectra we might want to just check against parent masses. We can do that my using the metadata of the database and calculating the expected single charged mz.
DB_selected$mz_calc <- DB_selected$computed_exact_mass_largest_neutral +
c(-1.0073, 1.0073)[DB_selected$polarity+1]Then we can simply match by mz of the parent:
param <- MzParam(tolerance = 0, ppm = 30)
matches_mz_only <- matchValues(as.data.frame(spectra_from_excel_mz_only),
DB_selected, param,
mzColname = c("mz", "mz_calc")
)Lets see what we hit this time:
DB_selected %>%
spectraData %>%
as_tibble %>%
slice(matches(matches_mz_only)$target_idx) %>%
mutate(query_idx = matches(matches_mz_only)$query_idx,
target_idx = matches(matches_mz_only)$target_idx
) %>%
select(query_idx, target_idx, name, msLevel, polarity, rtime, dataOrigin)## # A tibble: 8 × 7
## query_idx target_idx name msLevel polarity rtime dataOrigin
## <int> <int> <chr> <int> <int> <dbl> <chr>
## 1 1 1 LPC(16:0) 1 1 286. /media/hostsource…
## 2 1 2 LPC(16:0) 2 1 286. /media/hostsource…
## 3 2 419 L-Phenylalanine 1 1 110. /media/hostsource…
## 4 2 420 L-Phenylalanine 2 1 110. /media/hostsource…
## 5 2 632 L-Phenylalanine 1 1 110. /media/hostsource…
## 6 3 419 L-Phenylalanine 1 1 110. /media/hostsource…
## 7 3 420 L-Phenylalanine 2 1 110. /media/hostsource…
## 8 3 632 L-Phenylalanine 1 1 110. /media/hostsource…
So we got the same hits as originally.