Manually annotate masses with database

This tutorial contains references to specific chromatograhic systems in use at NEXS, University of Copenhagen and might be confusing to others.

If you want a copy of this Rmarkdown tutorial you can do the following, which will create a copy in your working folder:

file.copy(system.file("doc/from_masses.Rmd", package="XCMSAnnotator"), ".")

If you would rather have a script you can do:

knitr::purl(system.file("doc/from_masses.Rmd", package="XCMSAnnotator"))

Load some required packages

library(tidyverse)
library(XCMSAnnotator)
library(MetaboAnnotation)
library(ProtGenerics)
library(readxl)
library(Spectra)
library(conflicted)
conflicts_prefer(dplyr::filter)
conflicts_prefer(MetaboAnnotation::matches)

If you are missing packages you can run devtools::install_gitlab("stanstrup_r_packages/xcms-annotator").

Creating Spectra objects

Manually creating a set of spectra with retention time and intensities

To understand how we can manually create a spectrum object we will do it first by hand.

First we make a table that contains the msLevel (MS1 or MS2) and retention time for two spectra:

spd <- tibble(msLevel = c(1L, 2L),
              rtime = c(1.1, 1.2)
              )

We can then add the mz and intensities for these two peaks:

spd$mz <- list(c(100, 103.2, 104.3, 106.5), 
               c(45.6, 120.4, 190.2)
               )

spd$intensity <- list(c(200, 400, 34.2, 17), 
                      c(12.3, 15.2, 6.8)
                      )

We can now make a Spectra object:

spectra_with_rt_and_int <- Spectra(spd)

spectra_with_rt_and_int

## MSn data (Spectra) with 2 spectra in a MsBackendMemory backend:
##     msLevel     rtime scanIndex
##   <integer> <numeric> <integer>
## 1         1       1.1        NA
## 2         2       1.2        NA
##  ... 16 more variables/columns.

We can also re-create a table with the spectra:

spectra_with_rt_and_int %>% 
                peaksData %>% 
                as_tibble %>% 
                group_nest(group) %>% 
                bind_cols(as_tibble(spectraData(spectra_with_rt_and_int)[, c("msLevel", "rtime")]),.) %>% 
                unnest(data)

## # A tibble: 7 × 6
##   msLevel rtime group group_name    mz intensity
##     <int> <dbl> <int> <chr>      <dbl>     <dbl>
## 1       1   1.1     1 NA         100       200  
## 2       1   1.1     1 NA         103.      400  
## 3       1   1.1     1 NA         104.       34.2
## 4       1   1.1     1 NA         106.       17  
## 5       2   1.2     2 NA          45.6      12.3
## 6       2   1.2     2 NA         120.       15.2
## 7       2   1.2     2 NA         190.        6.8

By reading the data from an excel file instead

A more comfortable way to read some manual spectra might be to read them from an Excel table like this:

spectra_from_excel <- read_xlsx(system.file("extdata/manual_spectra.xlsx", package="XCMSAnnotator"))

spectra_from_excel

## # A tibble: 23 × 6
##    ionmode group msLevel rtime    mz intensity
##    <chr>   <dbl>   <dbl> <dbl> <dbl>     <dbl>
##  1 pos       111       1  4.77 478.       1.12
##  2 pos       111       1  4.77 496.     100   
##  3 pos       111       1  4.77 497.      19.6 
##  4 pos       111       1  4.77 498.       2.24
##  5 pos       111       1  4.77 518.       5.01
##  6 pos       111       1  4.77 524.       1.69
##  7 pos       111       1  4.77 525.       1.04
##  8 pos       111       1  4.77 992.      14.1 
##  9 pos       111       1  4.77 993.       6.02
## 10 pos       980       1  1.83  93.1      1.58
## # ℹ 13 more rows

Here we use the group column to uniquely identify each compound. So you can make your data look the same.

Now we then convert that to a spectra object:

spectra_from_excel_as_spectra <- spectra_from_excel %>% 
                                  chop(c(mz, intensity)) %>% 
                                  mutate(msLevel = as.integer(msLevel)) %>%
                                  mutate(rtime = rtime*60) %>% # because we set rt in minutes but the DB is in seconds
                                  Spectra

spectra_from_excel_as_spectra

## MSn data (Spectra) with 3 spectra in a MsBackendMemory backend:
##     msLevel     rtime scanIndex
##   <integer> <numeric> <integer>
## 1         1   286.377        NA
## 2         1   109.548        NA
## 3         2   109.797        NA
##  ... 18 more variables/columns.

Matching to a database

Now we read your database of compounds that was created by MScurate and filter it for a specific machine. Here we are matching to the data from the Synapt.

DB <- readRDS(system.file("extdata/DB.rds", package="XCMSAnnotator"))

DB_selected <- DB %>% 
                spectraData %>% 
                as_tibble %>% 
                pull(dataOrigin) %>% 
                grep("Synapt", .) %>% # We want only the standards that have "Synapt" in the path.
                {DB[.]}


DB_selected

## MSn data (Spectra) with 1435 spectra in a MsBackendDataFrame backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1   286.377      1135
## 2            2   286.129      1134
## 3            1   299.703      1188
## 4            1   289.383      1147
## 5            1   295.728      1172
## ...        ...       ...       ...
## 1431         1   195.548       774
## 1432         2   195.796       775
## 1433         2    31.742       123
## 1434         1    29.513       116
## 1435         2    29.761       117
##  ... 48 more variables/columns.
## Processing:
##  Merge 2304 Spectra into one [Mon Sep 27 13:09:49 2021]

We can now match the DB and the peaklist. First we setup some matching criteria.

mp <- MatchForwardReverseParam(ppm = 30, 
                               requirePrecursor = FALSE, 
                               THRESHFUN = function(x) which(x > 0), 
                               THRESHFUN_REVERSE = function(x) which(x >= 0.7), # so only keep matches if the reverse score is better than 0.7
                               toleranceRt = 0.2*60 # 0.2 min tolerance.
                               )

We do the matching:

mtches <- matchSpectra(spectra_from_excel_as_spectra, DB_selected, param = mp)

To avoid splitting both the database and the peaklist we have matching across polarities, which obvious doesn’t make sense. So here we remove matching that are not between spectra of the same polarity.

mtches_mode <- filterMisMatchedPolarity(mtches, query_mode_col = "ionmode", target_mode_col = "polarity")

mtches_mode

## Object of class MatchedSpectra 
## Total number of matches: 8 
## Number of query objects: 3 (3 matched)
## Number of target objects: 1435 (5 matched)

As you can see in the summary all our 3 spectra have been found.

Now lets see what we found:

DB_selected %>% 
  spectraData %>%
  as_tibble %>% 
  slice(matches(mtches_mode)$target_idx) %>%
  mutate(query_idx = matches(mtches_mode)$query_idx, target_idx = matches(mtches_mode)$target_idx) %>% 
  select(query_idx, target_idx, name, msLevel, polarity, rtime, dataOrigin)

## # A tibble: 8 × 7
##   query_idx target_idx name            msLevel polarity rtime dataOrigin        
##       <int>      <int> <chr>             <int>    <int> <dbl> <chr>             
## 1         1          1 LPC(16:0)             1        1  286. /media/hostsource…
## 2         1          2 LPC(16:0)             2        1  286. /media/hostsource…
## 3         2        419 L-Phenylalanine       1        1  110. /media/hostsource…
## 4         2        420 L-Phenylalanine       2        1  110. /media/hostsource…
## 5         2        632 L-Phenylalanine       1        1  110. /media/hostsource…
## 6         3        419 L-Phenylalanine       1        1  110. /media/hostsource…
## 7         3        420 L-Phenylalanine       2        1  110. /media/hostsource…
## 8         3        632 L-Phenylalanine       1        1  110. /media/hostsource…

You can also get more details about the match:

matches(mtches_mode)

##    query_idx target_idx     score reverse_score presence_ratio
## 1          1          1 0.9435464     0.9435464      0.1607143
## 2          1          2 0.8070411     0.8070411      0.2195122
## 12         2        419 0.9457657     0.9457657      0.2608696
## 13         2        420 0.8498953     0.8870301      0.3333333
## 14         2        632 0.9524082     0.9642384      0.4545455
## 16         3        419 0.8699097     0.8699097      0.3478261
## 17         3        420 0.9692363     0.9692363      0.5333333
## 18         3        632 0.8763451     0.8835292      0.6363636
##    matched_peaks_count
## 1                    9
## 2                    9
## 12                   6
## 13                   5
## 14                   5
## 16                   8
## 17                   8
## 18                   7

Matching just a mass to all spectra

Can we do it also if we just had a single mz we wanted to look-up? Yes!

Lets imagine we simply want to match this list:

spectra_from_excel_mz_only <- spectra_from_excel %>% 
                                  slice(c(2, 15, 21)) %>% 
                                  select(ionmode, group, mz)

spectra_from_excel_mz_only

## # A tibble: 3 × 3
##   ionmode group    mz
##   <chr>   <dbl> <dbl>
## 1 pos       111  496.
## 2 pos       980  166.
## 3 pos       981  166.

We again can “fake” a Spectra object.

spectra_from_excel_mz_only_as_spectra <- spectra_from_excel_mz_only %>% 
                                            mutate(intensity = 100) %>% # it cannot be empty to we just put something
                                            chop(c(mz, intensity)) %>% 
                                            Spectra

spectra_from_excel_mz_only_as_spectra

## MSn data (Spectra) with 3 spectra in a MsBackendMemory backend:
##     msLevel     rtime scanIndex
##   <integer> <numeric> <integer>
## 1        NA        NA        NA
## 2        NA        NA        NA
## 3        NA        NA        NA
##  ... 18 more variables/columns.

We make the parameters to match by. This time we require only the mass to match no matter the score or RT.

mp <- MatchForwardReverseParam(ppm = 30, 
                               requirePrecursor = FALSE, 
                               THRESHFUN = function(x) which(x > 0), 
                               THRESHFUN_REVERSE = function(x) which(x >= 0), # so all matches are kept
                               toleranceRt = Inf
                               )

We do the matching:

mtches <- matchSpectra(spectra_from_excel_mz_only_as_spectra, DB_selected, param = mp)

mtches_mode <- filterMisMatchedPolarity(mtches, query_mode_col = "ionmode", target_mode_col = "polarity")

mtches_mode

## Object of class MatchedSpectra 
## Total number of matches: 44 
## Number of query objects: 3 (3 matched)
## Number of target objects: 1435 (24 matched)

This time we get 44 matches. So obviously some are wrong…

Now lets see what we found:

DB_selected %>% 
  spectraData %>%
  as_tibble %>% 
  slice(matches(mtches_mode)$target_idx) %>%
  mutate(query_idx = matches(mtches_mode)$query_idx, target_idx = matches(mtches_mode)$target_idx) %>% 
  select(query_idx, target_idx, name, msLevel, polarity, rtime, dataOrigin)

## # A tibble: 44 × 7
##    query_idx target_idx name             msLevel polarity rtime dataOrigin      
##        <int>      <int> <chr>              <int>    <int> <dbl> <chr>           
##  1         1          1 LPC(16:0)              1        1 286.  /media/hostsour…
##  2         1          2 LPC(16:0)              2        1 286.  /media/hostsour…
##  3         1          9 LPC(17:1)              1        1 284.  /media/hostsour…
##  4         1        744 LPC (18:1)             1        1 288.  /media/hostsour…
##  5         2        161 stachydrine            1        1  36.5 /media/hostsour…
##  6         2        295 α-Methyl-L-DOPA        1        1  62.5 /media/hostsour…
##  7         2        319 Asp-Phe                1        1 163.  /media/hostsour…
##  8         2        320 Asp-Phe                2        1 163.  /media/hostsour…
##  9         2        414 Pantothenic acid       1        1 189.  /media/hostsour…
## 10         2        415 Pantothenic acid       1        1 131.  /media/hostsour…
## # ℹ 34 more rows

You can also get more details about the match:

matches(mtches_mode)

##    query_idx target_idx        score reverse_score presence_ratio
## 1          1          1 0.6252763335  0.6252763335     0.01785714
## 2          1          2 0.4183212537  0.4183212537     0.02439024
## 3          1          9 0.0003687644  0.0003687644     0.00800000
## 4          1        744 0.0001975497  0.0001975497     0.00800000
## 7          2        161 0.0083635038  0.0083635038     0.09090909
## 8          2        295 0.3898961889  0.3898961889     0.04347826
## 9          2        319 0.2309524283  0.2309524283     0.02222222
## 10         2        320 0.0338301984  0.0338301984     0.02857143
## 11         2        414 0.0108036658  0.0108036658     0.04545455
## 12         2        415 0.0063866429  0.0063866429     0.02439024
## 13         2        419 0.0233418240  0.0233418240     0.04347826
## 14         2        420 0.0284218862  0.0284218862     0.06666667
## 15         2        570 0.3981241686  0.3981241686     0.05000000
## 16         2        583 0.2381783831  0.2381783831     0.02173913
## 17         2        632 0.0250502963  0.0250502963     0.09090909
## 18         2        877 0.1973823876  0.1973823876     0.05000000
## 19         2        878 0.1398388075  0.1398388075     0.08333333
## 20         2        955 0.0042252548  0.0042252548     0.01265823
## 21         2       1003 0.1982663747  0.1982663747     0.03846154
## 22         2       1004 0.0987288210  0.0987288210     0.09090909
## 23         2       1333 0.0695714147  0.0695714147     0.04761905
## 24         2       1334 0.0360922985  0.0360922985     0.10000000
## 25         2       1358 0.0481780181  0.0481780181     0.01754386
## 26         2       1359 0.0488774586  0.0488774586     0.02857143
## 31         3        161 0.0083635038  0.0083635038     0.09090909
## 32         3        295 0.3898961889  0.3898961889     0.04347826
## 33         3        319 0.2309524283  0.2309524283     0.02222222
## 34         3        320 0.0338301984  0.0338301984     0.02857143
## 35         3        414 0.0108036658  0.0108036658     0.04545455
## 36         3        415 0.0063866429  0.0063866429     0.02439024
## 37         3        419 0.0233418240  0.0233418240     0.04347826
## 38         3        420 0.0284218862  0.0284218862     0.06666667
## 39         3        570 0.3981241686  0.3981241686     0.05000000
## 40         3        583 0.2381783831  0.2381783831     0.02173913
## 41         3        632 0.0250502963  0.0250502963     0.09090909
## 42         3        877 0.1973823876  0.1973823876     0.05000000
## 43         3        878 0.1398388075  0.1398388075     0.08333333
## 44         3        955 0.0042252548  0.0042252548     0.01265823
## 45         3       1003 0.1982663747  0.1982663747     0.03846154
## 46         3       1004 0.0987288210  0.0987288210     0.09090909
## 47         3       1333 0.0695714147  0.0695714147     0.04761905
## 48         3       1334 0.0360922985  0.0360922985     0.10000000
## 49         3       1358 0.0481780181  0.0481780181     0.01754386
## 50         3       1359 0.0488774586  0.0488774586     0.02857143
##    matched_peaks_count
## 1                    1
## 2                    1
## 3                    1
## 4                    1
## 7                    1
## 8                    1
## 9                    1
## 10                   1
## 11                   1
## 12                   1
## 13                   1
## 14                   1
## 15                   1
## 16                   1
## 17                   1
## 18                   1
## 19                   1
## 20                   1
## 21                   1
## 22                   1
## 23                   1
## 24                   1
## 25                   1
## 26                   1
## 31                   1
## 32                   1
## 33                   1
## 34                   1
## 35                   1
## 36                   1
## 37                   1
## 38                   1
## 39                   1
## 40                   1
## 41                   1
## 42                   1
## 43                   1
## 44                   1
## 45                   1
## 46                   1
## 47                   1
## 48                   1
## 49                   1
## 50                   1

OK, this is really only useful if you are really desperate. Way to many spurious matches to be generally useful.

But what if we try to remove really small peaks? We can do that by removing peaks below a certain intensity (here we cut everything below 5% of the highest peak):

above_rel_thres <- function(x, thres = 5) {
    100*x/max(x, na.rm = TRUE) >= thres
}

DB_selected_norm_cut <-  filterIntensity(DB_selected, above_rel_thres, thres = 5)

And then run the matching again.

mp <- MatchForwardReverseParam(ppm = 30, 
                               requirePrecursor = FALSE, 
                               THRESHFUN = function(x) which(x > 0), 
                               THRESHFUN_REVERSE = function(x) which(x >= 0), # so all matches are kept
                               toleranceRt = Inf
                               )


mtches <- matchSpectra(spectra_from_excel_mz_only_as_spectra, DB_selected_norm_cut, param = mp)

mtches_mode <- filterMisMatchedPolarity(mtches, query_mode_col = "ionmode", target_mode_col = "polarity")

mtches_mode

## Object of class MatchedSpectra 
## Total number of matches: 30 
## Number of query objects: 3 (3 matched)
## Number of target objects: 1435 (16 matched)

DB_selected_norm_cut %>% 
  spectraData %>%
  as_tibble %>% 
  slice(matches(mtches_mode)$target_idx) %>%
  mutate(query_idx = matches(mtches_mode)$query_idx, 
         target_idx = matches(mtches_mode)$target_idx
         ) %>% 
  select(query_idx, target_idx, name, msLevel, polarity, rtime, dataOrigin)

## # A tibble: 30 × 7
##    query_idx target_idx name            msLevel polarity rtime dataOrigin       
##        <int>      <int> <chr>             <int>    <int> <dbl> <chr>            
##  1         1          1 LPC(16:0)             1        1 286.  /media/hostsourc…
##  2         1          2 LPC(16:0)             2        1 286.  /media/hostsourc…
##  3         2        295 α-Methyl-L-DOPA       1        1  62.5 /media/hostsourc…
##  4         2        319 Asp-Phe               1        1 163.  /media/hostsourc…
##  5         2        320 Asp-Phe               2        1 163.  /media/hostsourc…
##  6         2        570 α-Methyl-L-DOPA       1        1  62.5 /media/hostsourc…
##  7         2        583 Asp-Phe               1        1 163.  /media/hostsourc…
##  8         2        877 Val-Phe               1        1 192.  /media/hostsourc…
##  9         2        878 Val-Phe               2        1 192.  /media/hostsourc…
## 10         2        955 Trolox                2        1 244.  /media/hostsourc…
## # ℹ 20 more rows

So yes, we got fewer hits… But we lost phenylalanine because the intensity is actually very low!

For something that might be more sensible check the next section.

Matching just a mass to all parent masses

So instead if matching against all peaks in all spectra we might want to just check against parent masses. We can do that my using the metadata of the database and calculating the expected single charged mz.

DB_selected$mz_calc <- DB_selected$computed_exact_mass_largest_neutral + 
                       c(-1.0073, 1.0073)[DB_selected$polarity+1]

Then we can simply match by mz of the parent:

param <- MzParam(tolerance = 0, ppm = 30)
matches_mz_only <- matchValues(as.data.frame(spectra_from_excel_mz_only), 
                               DB_selected, param, 
                               mzColname = c("mz", "mz_calc")
                               )

Lets see what we hit this time:

DB_selected %>% 
  spectraData %>%
  as_tibble %>% 
  slice(matches(matches_mz_only)$target_idx) %>%
  mutate(query_idx = matches(matches_mz_only)$query_idx, 
         target_idx = matches(matches_mz_only)$target_idx
         ) %>% 
  select(query_idx, target_idx, name, msLevel, polarity, rtime, dataOrigin)

## # A tibble: 8 × 7
##   query_idx target_idx name            msLevel polarity rtime dataOrigin        
##       <int>      <int> <chr>             <int>    <int> <dbl> <chr>             
## 1         1          1 LPC(16:0)             1        1  286. /media/hostsource…
## 2         1          2 LPC(16:0)             2        1  286. /media/hostsource…
## 3         2        419 L-Phenylalanine       1        1  110. /media/hostsource…
## 4         2        420 L-Phenylalanine       2        1  110. /media/hostsource…
## 5         2        632 L-Phenylalanine       1        1  110. /media/hostsource…
## 6         3        419 L-Phenylalanine       1        1  110. /media/hostsource…
## 7         3        420 L-Phenylalanine       2        1  110. /media/hostsource…
## 8         3        632 L-Phenylalanine       1        1  110. /media/hostsource…

So we got the same hits as originally.

Jan Stanstrup