pyXLMS.transform package#

Submodules#

pyXLMS.transform.aggregate module#

pyXLMS.transform.aggregate.aggregate(
csms: List[Dict[str, Any]],
by: Literal['peptide', 'protein'] = 'peptide',
score: Literal['higher_better', 'lower_better'] = 'higher_better',
) List[Dict[str, Any]][source]#

Aggregate crosslink-spectrum-matches to crosslinks.

Aggregates a list of crosslink-spectrum-matches to unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if by = "peptide", otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the crosslink of the first corresponding crosslink-spectrum -match in the list is kept instead.

Parameters:
  • csms (list of dict of str, any) – A list of crosslink-spectrum-matches.

  • by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. If protein crosslink position is not available for all crosslink-spectrum-matches a ValueError will be raised. Make sure that all crosslink-spectrum-matches have the _proteins and _proteins_crosslink_positions fields set. If this is not already done by the parser, this can be achieved with transform.reannotate_positions().

  • score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.

Returns:

A list of aggregated, unique crosslinks.

Return type:

list of dict of str, any

Warning

Aggregation will not conserve false discovery rate (FDR)! Aggregating crosslink-spectrum-matches that are validated for 1% FDR will not result in crosslinks validated for 1% FDR! Aggregated crosslinks should be validated with either external tools or with the built-in transform.validate()!

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.

  • TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.

  • ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import aggregate
>>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> aggregate_peptide = aggregate(pr["crosslink-spectrum-matches"], by="peptide")
>>> len(aggregate_peptide)
3
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import aggregate
>>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> aggregate_protein = aggregate(pr["crosslink-spectrum-matches"], by="protein")
>>> len(aggregate_protein)
2
pyXLMS.transform.aggregate.unique(
data: List[Dict[str, Any]] | Dict[str, Any],
by: Literal['peptide', 'protein'] = 'peptide',
score: Literal['higher_better', 'lower_better'] = 'higher_better',
) List[Dict[str, Any]] | Dict[str, Any][source]#

Filter for unique crosslinks or crosslink-spectrum-matches.

Filters for unique crosslinks from a list on non-unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if by = "peptide", otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslinks without scores are provided, the first crosslink in the list is kept instead.

or

Filters for unique crosslink-spectrum-matches from a list on non-unique crosslink-spectrum-matches. A crosslink- spectrum-match is considered unique if there is no other crosslink-spectrum-match from the same spectrum file and with the same scan number. If more than one crosslink-spectrum-match exists per spectrum file and scan number, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the first crosslink-spectrum-match in the list is kept instead.

Parameters:
  • data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to filter, or a parser_result.

  • by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. Only affects filtering for unique crosslinks and not crosslink-spectrum-matches. If protein crosslink position is not available for all crosslinks a ValueError will be raised. Make sure that all crosslinks have the _proteins and _proteins_crosslink_positions fields set. If this is not already done by the parser, this can be achieved with transform.reannotate_positions().

  • score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of unique crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with unique crosslink-spectrum-matches and/or unique crosslinks will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.

  • TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.

  • ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import unique
>>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> len(pr["crosslinks"])
10
>>> unique_peptide = unique(pr, by="peptide")
>>> len(unique_peptide["crosslink-spectrum-matches"])
5
>>> len(unique_peptide["crosslinks"])
3
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import unique
>>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> len(pr["crosslinks"])
10
>>> unique_protein = unique(pr, by="protein")
>>> len(unique_protein["crosslink-spectrum-matches"])
5
>>> len(unique_protein["crosslinks"])
2

pyXLMS.transform.filter module#

Separate crosslinks and crosslink-spectrum-matches by their crosslink type.

Gets all crosslinks or crosslink-spectrum-matches depending on crosslink type. Will separate based on if a crosslink or crosslink-spectrum-match is of type “intra” or “inter” crosslink.

Parameters:

data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.

Returns:

Returns a dictionary with key Intra which contains all crosslinks or crosslink-spectrum- matches with crosslink type = “intra”, and key Inter which contains all crosslinks or crosslink-spectrum-matches with crosslink type = “inter”.

Return type:

dict of str, list of dict

Raises:

TypeError – If an unsupported data type is provided.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_crosslink_type
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> crosslink_type_filtered_csms = filter_crosslink_type(result["crosslink-spectrum-matches"])
>>> len(crosslink_type_filtered_csms["Intra"])
803
>>> len(crosslink_type_filtered_csms["Inter"])
23
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_crosslink_type
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> crosslink_type_filtered_crosslinks = filter_crosslink_type(result["crosslinks"])
>>> len(crosslink_type_filtered_crosslinks["Intra"])
279
>>> len(crosslink_type_filtered_crosslinks["Inter"])
21
pyXLMS.transform.filter.filter_proteins(
data: List[Dict[str, Any]],
proteins: Set[str] | List[str],
) Dict[str, List[Any]][source]#

Get all crosslinks or crosslink-spectrum-matches originating from proteins of interest.

Gets all crosslinks or crosslink-spectrum-matches originating from a list of proteins of interest and returns a list of crosslinks or crosslink-spectrum-matches where both peptides come from a protein of interest and a list of crosslinks or crosslink-spectrum-matches where one of the peptides comes from a protein of interest.

Parameters:
  • data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.

  • proteins (set of str, or list of str) – A set of protein accessions of interest.

Returns:

Returns a dictionary with key Proteins which contains the list of proteins of interest, key Both which contains all crosslinks or crosslink-spectrum-matches where both peptides are originating from a protein of interest, and key One which contains all crosslinks or crosslink-spectrum-matches where one of the two peptides is originating from a protein of interest.

Return type:

dict

Raises:

TypeError – If an unsupported data type is provided.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_proteins
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> proteins_csms = filter_proteins(result["crosslink-spectrum-matches"], ["Cas9"])
>>> proteins_csms["Proteins"]
['Cas9']
>>> len(proteins_csms["Both"])
798
>>> len(proteins_csms["One"])
23
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_proteins
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> proteins_xls = filter_proteins(result["crosslinks"], ["Cas9"])
>>> proteins_xls["Proteins"]
['Cas9']
>>> len(proteins_xls["Both"])
274
>>> len(proteins_xls["One"])
21
pyXLMS.transform.filter.filter_target_decoy(
data: List[Dict[str, Any]],
) Dict[str, List[Dict[str, Any]]][source]#

Seperate crosslinks or crosslink-spectrum-matches based on target and decoy matches.

Seperates crosslinks or crosslink-spectrum-matches based on if both peptides match to the target database, or if both match to the decoy database, or if one of them matches to the target database and the other to the decoy database. The first we denote as “Target-Target” or “TT” matches, the second as “Decoy-Decoy” or “DD” matches, and the third as “Target-Decoy” or “TD” matches.

Parameters:

data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.

Returns:

Returns a dictionary with key Target-Target which contains all TT matches, key Target-Decoy which contains all TD matches, and key Decoy-Decoy which contains all DD matches.

Return type:

dict

Raises:

TypeError – If an unsupported data type is provided.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_target_decoy
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> target_and_decoys = filter_target_decoy(result["crosslink-spectrum-matches"])
>>> len(target_and_decoys["Target-Target"])
786
>>> len(target_and_decoys["Target-Decoy"])
39
>>> len(target_and_decoys["Decoy-Decoy"])
1
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_target_decoy
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> target_and_decoys = filter_target_decoy(result["crosslinks"])
>>> len(target_and_decoys["Target-Target"])
265
>>> len(target_and_decoys["Target-Decoy"])
0
>>> len(target_and_decoys["Decoy-Decoy"])
35

pyXLMS.transform.reannotate_positions module#

pyXLMS.transform.reannotate_positions.fasta_title_to_accession(title: str) str[source]#

Parses the protein accession from a UniProt-like title.

Parameters:

title (str) – Fasta title/header.

Returns:

The protein accession parsed from the title. If parsing was unsuccessful the full title is returned.

Return type:

str

Examples

>>> from pyXLMS.transform import fasta_title_to_accession
>>> title = "sp|A0A087X1C5|CP2D7_HUMAN Putative cytochrome P450 2D7 OS=Homo sapiens OX=9606 GN=CYP2D7 PE=5 SV=1"
>>> fasta_title_to_accession(title)
'A0A087X1C5'
>>> from pyXLMS.transform import fasta_title_to_accession
>>> title = "Cas9"
>>> fasta_title_to_accession(title)
'Cas9'
pyXLMS.transform.reannotate_positions.reannotate_positions(
data: List[Dict[str, Any]] | Dict[str, Any],
fasta: str | BinaryIO,
title_to_accession: Callable[[str], str] | None = None,
) List[Dict[str, Any]] | Dict[str, Any][source]#

Reannotates protein crosslink positions for a given fasta file.

Reannotates the crosslink and peptide positions of the given cross-linked peptide pair and the specified fasta file. Takes a list of crosslink-spectrum-matches or crosslinks, or a parser_result as input.

Parameters:
  • data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to annotate, or a parser_result.

  • fasta (str, or file stream) – The name/path of the fasta file containing protein sequences or a file-like object/stream.

  • title_to_accession (callable, or None, default = None) – A function that parses the protein accession from the fasta title/header. If None (default) the function fasta_title_to_accession is used.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of annotated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an annotated parser_result will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:

TypeError – If a wrong data type is provided.

Examples

>>> from pyXLMS.data import create_crosslink_min
>>> from pyXLMS.transform import reannotate_positions
>>> xls = [create_crosslink_min("ADANLDK", 7, "GNTDRHSIK", 9)]
>>> xls = reannotate_positions(xls, "data/_fasta/Cas9_plus10.fasta")
>>> xls[0]["alpha_proteins"]
["Cas9"]
>>> xls[0]["alpha_proteins_crosslink_positions"]
[1293]
>>> xls[0]["beta_proteins"]
["Cas9"]
>>> xls[0]["beta_proteins_crosslink_positions"]
[48]

pyXLMS.transform.summary module#

pyXLMS.transform.summary.summary(
data: List[Dict[str, Any]] | Dict[str, Any],
) Dict[str, float][source]#

Extracts summary stats from a list of crosslinks or crosslink-spectrum-matches, or a parser_result.

Extracts summary statistics from a list of crosslinks or crosslink-spectrum-matches, or a parser_result. The statistic depend on the supplied data type, if a list of crosslinks is supplied a dictionary with the following statistics and keys is returned:

  • Number of crosslinks

  • Number of unique crosslinks by peptide

  • Number of unique crosslinks by protein

  • Number of intra crosslinks

  • Number of inter crosslinks

  • Number of target-target crosslinks

  • Number of target-decoy crosslinks

  • Number of decoy-decoy crosslinks

  • Minimum crosslink score

  • Maximum crosslink score

If a list of crosslink-spectrum-matches is supplied dictionary with the following statistics and keys is returned:

  • Number of CSMs

  • Number of unique CSMs

  • Number of intra CSMs

  • Number of inter CSMs

  • Number of target-target CSMs

  • Number of target-decoy CSMs

  • Number of decoy-decoy CSMs

  • Minimum CSM score

  • Maximum CSM score

If a parser_result is supplied, a dictionary with both containing all of these is returned - if they are available. A parser_result that only contains crosslinks will only yield a dictionary with crosslink statistics, and vice versa a parser_result that only contains crosslink-spectrum-matches will only yield a dictionary with crosslink-spectrum- match statistics. If the parser_result result contains both, then both dictionaries will be merged and returned. Please note that in this case a single dictionary is returned, that contains both the keys for crosslinks and crosslink-spectrum-matches.

Statistics are also printed to stdout.

Parameters:

data (list of dict of str, any, or dict of str, any) – A list of crosslinks or crosslink-spectrum-matches, or a parser_result.

Returns:

A dictionary with summary statistics.

Return type:

dict of str, float

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import summary
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> csms = pr["crosslink-spectrum-matches"]
>>> stats = summary(csms)
Number of CSMs: 826.0
Number of unique CSMs: 826.0
Number of intra CSMs: 803.0
Number of inter CSMs: 23.0
Number of target-target CSMs: 786.0
Number of target-decoy CSMs: 39.0
Number of decoy-decoy CSMs: 1.0
Minimum CSM score: 1.11
Maximum CSM score: 452.99
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import summary
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> stats = summary(pr)
Number of crosslinks: 300.0
Number of unique crosslinks by peptide: 300.0
Number of unique crosslinks by protein: 298.0
Number of intra crosslinks: 279.0
Number of inter crosslinks: 21.0
Number of target-target crosslinks: 265.0
Number of target-decoy crosslinks: 0.0
Number of decoy-decoy crosslinks: 35.0
Minimum crosslink score: 1.11
Maximum crosslink score: 452.99

pyXLMS.transform.targets_only module#

pyXLMS.transform.targets_only.targets_only(
data: List[Dict[str, Any]] | Dict[str, Any],
) List[Dict[str, Any]] | Dict[str, Any][source]#

Get target crosslinks or crosslink-spectrum-matches.

Get target crosslinks or crosslink-spectrum-matches from a list of target and decoy crosslinks or crosslink-spectrum-matches, or a parser_result. This effectively filters out any target-decoy and decoy-decoy matches and is essentially a convenience wrapper for transform.filter_target_decoy()["Target-Target"].

Parameters:

data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks, or a parser_result.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of target crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with target crosslink-spectrum-matches and/or target crosslinks will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:
  • TypeError – If a wrong data type is provided.

  • RuntimeError – If no target crosslinks or crosslink-spectrum-matches were found.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> targets = targets_only(result["crosslink-spectrum-matches"])
>>> len(targets)
786
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> targets = targets_only(result["crosslinks"])
>>> len(targets)
265
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> result = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS")
>>> result_targets = targets_only(result)
>>> len(result_targets["crosslink-spectrum-matches"])
786
>>> len(result_targets["crosslinks"])
265

pyXLMS.transform.to_dataframe module#

pyXLMS.transform.to_dataframe.to_dataframe(
data: List[Dict[str, Any]],
) DataFrame[source]#

Returns a pandas DataFrame of the given crosslinks or crosslink-spectrum-matches.

Parameters:

data (list) – A list of crosslinks or crosslink-spectrum-matches as created by data.create_crosslink() or data.create_csm().

Returns:

The pandas DataFrame created from the list of input crosslinks or crosslink-spectrum-matches. A full specification of the returned DataFrame can be found in the docs.

Return type:

pandas.DataFrame

Raises:
  • TypeError – If the list does not contain crosslinks or crosslink-spectrum-matches.

  • ValueError – If the list does not contain any objects.

Examples

>>> from pyXLMS.transform import to_dataframe
>>> # assume that crosslinks is a list of crosslinks created by data.create_crosslink()
>>> crosslink_dataframe = to_dataframe(crosslinks)
>>> # assume csms is a list of crosslink-spectrum-matches created by data.create_csm()
>>> csm_dataframe = to_dataframe(csms)

pyXLMS.transform.to_proforma module#

pyXLMS.transform.to_proforma.to_proforma(
data: Dict[str, Any] | List[Dict[str, Any]],
crosslinker: str | float | None = None,
) str | List[str][source]#

Returns the Proforma string for a single crosslink or crosslink-spectrum-match, or for a list of crosslinks or crosslink-spectrum-matches.

Parameters:
  • data (dict of str, any, or list of dict of str, any) – A pyXLMS crosslink object, e.g. see data.create_crosslink(). Or a pyXLMS crosslink-spectrum-match object, e.g. see data.create_csm(). Alternatively, a list of crosslinks or crosslink-spectrum-matches can also be provided.

  • crosslinker (str, or float, or None, default = None) – Optional name or mass of the crosslink reagent. If the name is given, it should be a valid name from XLMOD. If the crosslink modification is contained in the crosslink-spectrum-match object this parameter has no effect.

Returns:

The Proforma string of the crosslink or crosslink-spectrum-match. If a list was provided a list containing all Proforma strings is returned.

Return type:

str

Raises:

TypeError – If an unsupported data type is provided.

Notes

  • Modifications with unknown mass are skipped.

  • If no modifications are given, only the crosslink modification will be encoded in the Proforma.

  • If no modifications are given and no crosslinker is given, the unmodified peptide Proforma will be returned.

Examples

>>> from pyXLMS.data import create_crosslink_min
>>> from pyXLMS.transform import to_proforma
>>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1)
>>> to_proforma(xl)
'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_crosslink_min
>>> from pyXLMS.transform import to_proforma
>>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1)
>>> to_proforma(xl, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1)
>>> to_proforma(csm)
'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1)
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)})
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)}, charge=3)
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE/3'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3)
>>> to_proforma(csm)
'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3)
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'

pyXLMS.transform.util module#

pyXLMS.transform.util.assert_data_type_same(data_list: List[Dict[str, Any]]) bool[source]#

Checks that all data is of the same data type.

Verifies that all elements in the provided list are of the same data type.

Parameters:

data_list (list of dict of str, any) – A list of dictionaries with the data_type key.

Returns:

If all elements are of the same data type.

Return type:

bool

Examples

>>> from pyXLMS.transform import assert_data_type_same
>>> from pyXLMS import data
>>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)]
>>> assert_data_type_same(data_list)
True
>>> from pyXLMS.transform import assert_data_type_same
>>> from pyXLMS import data
>>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_csm_min("KPEP", 1, "PEKP", 3, "RUN_1", 1)]
>>> assert_data_type_same(data_list)
False
pyXLMS.transform.util.get_available_keys(
data_list: List[Dict[str, Any]],
) Dict[str, bool][source]#

Checks which data is available from a list of crosslinks or crosslink-spectrum-matches.

Verifies which data fields have been set for all crosslinks or crosslink-spectrum-matches in the given list. Will return a dictionary structured the same as a crosslink or crosslink-spectrum-match, but instead of the data it will return either True or False, depending if the field was set or not.

Parameters:

data_list (list of dict of str, any) – A list of crosslinks or crosslink-spectrum-matches.

Returns:

  • If a list of crosslinks was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslinks in data_list. Keys: data_type, completeness, alpha_peptide, alpha_peptide_crosslink_position, alpha_proteins, alpha_proteins_crosslink_positions, alpha_decoy, beta_peptide, beta_peptide_crosslink_position, beta_proteins, beta_proteins_crosslink_positions, beta_decoy, crosslink_type, score, and additional_information.

  • If a list of crosslink-spectrum-matches was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslink-spectrum-matches in data_list. Keys: data_type, completeness, alpha_peptide, alpha_modifications, alpha_peptide_crosslink_position, alpha_proteins, alpha_proteins_crosslink_positions, alpha_proteins_peptide_positions, alpha_score, alpha_decoy, beta_peptide, beta_modifications, beta_peptide_crosslink_position, beta_proteins, beta_proteins_crosslink_positions, beta_proteins_peptide_positions, beta_score, beta_decoy, crosslink_type, score, spectrum_file, scan_nr, retention_time, ion_mobility, and additional_information.

Return type:

dict of str, bool

Raises:
  • TypeError – If not all elements in data_list are of the same data type.

  • TypeError – If one or more elements in the list are of an unsupported data type.

Examples

>>> from pyXLMS.transform import get_available_keys
>>> from pyXLMS import data
>>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)]
>>> available_keys = get_available_keys(data_list)
>>> available_keys["alpha_peptide"]
True
>>> available_keys["score"]
False
pyXLMS.transform.util.modifications_to_str(
modifications: Dict[int, Tuple[str, float]] | None,
) str | None[source]#

Returns the string representation of a modifications dictionary.

Parameters:

modifications (dict of [str, tuple], or None) – The modifications of a peptide given as a dictionary that maps peptide position (1-based) to modification given as a tuple of modification name and modification delta mass. N-terminal modifications should be denoted with position 0. C-terminal modifications should be denoted with position len(peptide) + 1.

Returns:

The string representation of the modifications (or None if no modification was provided).

Return type:

str, or None

Examples

>>> from pyXLMS.transform import modifications_to_str
>>> modifications_to_str({1: ("Oxidation", 15.994915), 5: ("Carbamidomethyl", 57.021464)})
'(1:[Oxidation|15.994915]);(5:[Carbamidomethyl|57.021464])'

pyXLMS.transform.validate module#

pyXLMS.transform.validate.validate(
data: List[Dict[str, Any]] | Dict[str, Any],
fdr: float = 0.01,
formula: Literal['D/T', '(TD+DD)/TT', '(TD-DD)/TT'] = 'D/T',
score: Literal['higher_better', 'lower_better'] = 'higher_better',
separate_intra_inter: bool = False,
ignore_missing_labels: bool = False,
) List[Dict[str, Any]] | Dict[str, Any][source]#

Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate.

Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate (FDR) using the defined formula. Requires that “score”, “alpha_decoy” and “beta_decoy” fields are set for crosslinks and crosslink-spectrum-matches.

Parameters:
  • data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to validate, or a parser_result.

  • fdr (float, default = 0.01) – The target FDR, must be given as a real number between 0 and 1. The default of 0.01 corresponds to 1% FDR.

  • formula (str, one of "D/T", "(TD+DD)/TT", or "(TD-DD)/TT", default = "D/T") – Which formula to use to estimate FDR. D and DD denote decoy matches, T and TT denote target matches, and TD denotes target-decoy and decoy-target matches.

  • score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.

  • separate_intra_inter (bool, default = False) – If FDR should be estimated separately for intra and inter matches.

  • ignore_missing_labels (bool, default = False) – If crosslinks and crosslink-spectrum-matches should be ignored if they don’t have target and decoy labels. By default and error is thrown if any unlabelled data is encountered.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of validated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an parser_result with validated crosslink-spectrum-matches and/or validated crosslinks will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If parameter formula is not one of ‘D/T’, ‘(TD+DD)/TT’, or ‘(TD-DD)/TT’.

  • TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.

  • ValueError – If parameter fdr is outside of the supported range.

  • ValueError – If attribute ‘score’ is not available for any of the data.

  • ValueError – If attribute ‘alpha_decoy’ or ‘beta_decoy’ is not available for any of the data and parameter ignore_missing_labels is set to False.

  • ValueError – If the number of DD matches exceeds the number of TD matches for formula ‘(TD-DD)/TT’. FDR can not be estimated with the formula ‘(TD-DD)/TT’ in these cases.

Notes

Please note that progress bars will usually not complete when running this function. This is by design as it is not necessary to iterate over all scores to estimate FDR.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import validate
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> csms = pr["crosslink-spectrum-matches"]
>>> len(csms)
826
>>> validated = validate(csms)
>>> len(validated)
705
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import validate
>>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
826
>>> len(pr["crosslinks"])
300
>>> validated = validate(pr)
>>> len(validated["crosslink-spectrum-matches"])
705
>> len(validated["crosslinks"])
226
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import validate
>>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
826
>>> len(pr["crosslinks"])
300
>>> validated = validate(pr, fdr=0.05)
>>> len(validated["crosslink-spectrum-matches"])
825
>> len(validated["crosslinks"])
260

Module contents#

pyXLMS.transform.aggregate(
csms: List[Dict[str, Any]],
by: Literal['peptide', 'protein'] = 'peptide',
score: Literal['higher_better', 'lower_better'] = 'higher_better',
) List[Dict[str, Any]][source]#

Aggregate crosslink-spectrum-matches to crosslinks.

Aggregates a list of crosslink-spectrum-matches to unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if by = "peptide", otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the crosslink of the first corresponding crosslink-spectrum -match in the list is kept instead.

Parameters:
  • csms (list of dict of str, any) – A list of crosslink-spectrum-matches.

  • by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. If protein crosslink position is not available for all crosslink-spectrum-matches a ValueError will be raised. Make sure that all crosslink-spectrum-matches have the _proteins and _proteins_crosslink_positions fields set. If this is not already done by the parser, this can be achieved with transform.reannotate_positions().

  • score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.

Returns:

A list of aggregated, unique crosslinks.

Return type:

list of dict of str, any

Warning

Aggregation will not conserve false discovery rate (FDR)! Aggregating crosslink-spectrum-matches that are validated for 1% FDR will not result in crosslinks validated for 1% FDR! Aggregated crosslinks should be validated with either external tools or with the built-in transform.validate()!

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.

  • TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.

  • ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import aggregate
>>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> aggregate_peptide = aggregate(pr["crosslink-spectrum-matches"], by="peptide")
>>> len(aggregate_peptide)
3
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import aggregate
>>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> aggregate_protein = aggregate(pr["crosslink-spectrum-matches"], by="protein")
>>> len(aggregate_protein)
2
pyXLMS.transform.assert_data_type_same(data_list: List[Dict[str, Any]]) bool[source]#

Checks that all data is of the same data type.

Verifies that all elements in the provided list are of the same data type.

Parameters:

data_list (list of dict of str, any) – A list of dictionaries with the data_type key.

Returns:

If all elements are of the same data type.

Return type:

bool

Examples

>>> from pyXLMS.transform import assert_data_type_same
>>> from pyXLMS import data
>>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)]
>>> assert_data_type_same(data_list)
True
>>> from pyXLMS.transform import assert_data_type_same
>>> from pyXLMS import data
>>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_csm_min("KPEP", 1, "PEKP", 3, "RUN_1", 1)]
>>> assert_data_type_same(data_list)
False
pyXLMS.transform.fasta_title_to_accession(title: str) str[source]#

Parses the protein accession from a UniProt-like title.

Parameters:

title (str) – Fasta title/header.

Returns:

The protein accession parsed from the title. If parsing was unsuccessful the full title is returned.

Return type:

str

Examples

>>> from pyXLMS.transform import fasta_title_to_accession
>>> title = "sp|A0A087X1C5|CP2D7_HUMAN Putative cytochrome P450 2D7 OS=Homo sapiens OX=9606 GN=CYP2D7 PE=5 SV=1"
>>> fasta_title_to_accession(title)
'A0A087X1C5'
>>> from pyXLMS.transform import fasta_title_to_accession
>>> title = "Cas9"
>>> fasta_title_to_accession(title)
'Cas9'

Separate crosslinks and crosslink-spectrum-matches by their crosslink type.

Gets all crosslinks or crosslink-spectrum-matches depending on crosslink type. Will separate based on if a crosslink or crosslink-spectrum-match is of type “intra” or “inter” crosslink.

Parameters:

data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.

Returns:

Returns a dictionary with key Intra which contains all crosslinks or crosslink-spectrum- matches with crosslink type = “intra”, and key Inter which contains all crosslinks or crosslink-spectrum-matches with crosslink type = “inter”.

Return type:

dict of str, list of dict

Raises:

TypeError – If an unsupported data type is provided.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_crosslink_type
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> crosslink_type_filtered_csms = filter_crosslink_type(result["crosslink-spectrum-matches"])
>>> len(crosslink_type_filtered_csms["Intra"])
803
>>> len(crosslink_type_filtered_csms["Inter"])
23
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_crosslink_type
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> crosslink_type_filtered_crosslinks = filter_crosslink_type(result["crosslinks"])
>>> len(crosslink_type_filtered_crosslinks["Intra"])
279
>>> len(crosslink_type_filtered_crosslinks["Inter"])
21
pyXLMS.transform.filter_proteins(
data: List[Dict[str, Any]],
proteins: Set[str] | List[str],
) Dict[str, List[Any]][source]#

Get all crosslinks or crosslink-spectrum-matches originating from proteins of interest.

Gets all crosslinks or crosslink-spectrum-matches originating from a list of proteins of interest and returns a list of crosslinks or crosslink-spectrum-matches where both peptides come from a protein of interest and a list of crosslinks or crosslink-spectrum-matches where one of the peptides comes from a protein of interest.

Parameters:
  • data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.

  • proteins (set of str, or list of str) – A set of protein accessions of interest.

Returns:

Returns a dictionary with key Proteins which contains the list of proteins of interest, key Both which contains all crosslinks or crosslink-spectrum-matches where both peptides are originating from a protein of interest, and key One which contains all crosslinks or crosslink-spectrum-matches where one of the two peptides is originating from a protein of interest.

Return type:

dict

Raises:

TypeError – If an unsupported data type is provided.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_proteins
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> proteins_csms = filter_proteins(result["crosslink-spectrum-matches"], ["Cas9"])
>>> proteins_csms["Proteins"]
['Cas9']
>>> len(proteins_csms["Both"])
798
>>> len(proteins_csms["One"])
23
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_proteins
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> proteins_xls = filter_proteins(result["crosslinks"], ["Cas9"])
>>> proteins_xls["Proteins"]
['Cas9']
>>> len(proteins_xls["Both"])
274
>>> len(proteins_xls["One"])
21
pyXLMS.transform.filter_target_decoy(
data: List[Dict[str, Any]],
) Dict[str, List[Dict[str, Any]]][source]#

Seperate crosslinks or crosslink-spectrum-matches based on target and decoy matches.

Seperates crosslinks or crosslink-spectrum-matches based on if both peptides match to the target database, or if both match to the decoy database, or if one of them matches to the target database and the other to the decoy database. The first we denote as “Target-Target” or “TT” matches, the second as “Decoy-Decoy” or “DD” matches, and the third as “Target-Decoy” or “TD” matches.

Parameters:

data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.

Returns:

Returns a dictionary with key Target-Target which contains all TT matches, key Target-Decoy which contains all TD matches, and key Decoy-Decoy which contains all DD matches.

Return type:

dict

Raises:

TypeError – If an unsupported data type is provided.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_target_decoy
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> target_and_decoys = filter_target_decoy(result["crosslink-spectrum-matches"])
>>> len(target_and_decoys["Target-Target"])
786
>>> len(target_and_decoys["Target-Decoy"])
39
>>> len(target_and_decoys["Decoy-Decoy"])
1
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_target_decoy
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> target_and_decoys = filter_target_decoy(result["crosslinks"])
>>> len(target_and_decoys["Target-Target"])
265
>>> len(target_and_decoys["Target-Decoy"])
0
>>> len(target_and_decoys["Decoy-Decoy"])
35
pyXLMS.transform.get_available_keys(
data_list: List[Dict[str, Any]],
) Dict[str, bool][source]#

Checks which data is available from a list of crosslinks or crosslink-spectrum-matches.

Verifies which data fields have been set for all crosslinks or crosslink-spectrum-matches in the given list. Will return a dictionary structured the same as a crosslink or crosslink-spectrum-match, but instead of the data it will return either True or False, depending if the field was set or not.

Parameters:

data_list (list of dict of str, any) – A list of crosslinks or crosslink-spectrum-matches.

Returns:

  • If a list of crosslinks was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslinks in data_list. Keys: data_type, completeness, alpha_peptide, alpha_peptide_crosslink_position, alpha_proteins, alpha_proteins_crosslink_positions, alpha_decoy, beta_peptide, beta_peptide_crosslink_position, beta_proteins, beta_proteins_crosslink_positions, beta_decoy, crosslink_type, score, and additional_information.

  • If a list of crosslink-spectrum-matches was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslink-spectrum-matches in data_list. Keys: data_type, completeness, alpha_peptide, alpha_modifications, alpha_peptide_crosslink_position, alpha_proteins, alpha_proteins_crosslink_positions, alpha_proteins_peptide_positions, alpha_score, alpha_decoy, beta_peptide, beta_modifications, beta_peptide_crosslink_position, beta_proteins, beta_proteins_crosslink_positions, beta_proteins_peptide_positions, beta_score, beta_decoy, crosslink_type, score, spectrum_file, scan_nr, retention_time, ion_mobility, and additional_information.

Return type:

dict of str, bool

Raises:
  • TypeError – If not all elements in data_list are of the same data type.

  • TypeError – If one or more elements in the list are of an unsupported data type.

Examples

>>> from pyXLMS.transform import get_available_keys
>>> from pyXLMS import data
>>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)]
>>> available_keys = get_available_keys(data_list)
>>> available_keys["alpha_peptide"]
True
>>> available_keys["score"]
False
pyXLMS.transform.modifications_to_str(
modifications: Dict[int, Tuple[str, float]] | None,
) str | None[source]#

Returns the string representation of a modifications dictionary.

Parameters:

modifications (dict of [str, tuple], or None) – The modifications of a peptide given as a dictionary that maps peptide position (1-based) to modification given as a tuple of modification name and modification delta mass. N-terminal modifications should be denoted with position 0. C-terminal modifications should be denoted with position len(peptide) + 1.

Returns:

The string representation of the modifications (or None if no modification was provided).

Return type:

str, or None

Examples

>>> from pyXLMS.transform import modifications_to_str
>>> modifications_to_str({1: ("Oxidation", 15.994915), 5: ("Carbamidomethyl", 57.021464)})
'(1:[Oxidation|15.994915]);(5:[Carbamidomethyl|57.021464])'
pyXLMS.transform.reannotate_positions(
data: List[Dict[str, Any]] | Dict[str, Any],
fasta: str | BinaryIO,
title_to_accession: Callable[[str], str] | None = None,
) List[Dict[str, Any]] | Dict[str, Any][source]#

Reannotates protein crosslink positions for a given fasta file.

Reannotates the crosslink and peptide positions of the given cross-linked peptide pair and the specified fasta file. Takes a list of crosslink-spectrum-matches or crosslinks, or a parser_result as input.

Parameters:
  • data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to annotate, or a parser_result.

  • fasta (str, or file stream) – The name/path of the fasta file containing protein sequences or a file-like object/stream.

  • title_to_accession (callable, or None, default = None) – A function that parses the protein accession from the fasta title/header. If None (default) the function fasta_title_to_accession is used.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of annotated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an annotated parser_result will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:

TypeError – If a wrong data type is provided.

Examples

>>> from pyXLMS.data import create_crosslink_min
>>> from pyXLMS.transform import reannotate_positions
>>> xls = [create_crosslink_min("ADANLDK", 7, "GNTDRHSIK", 9)]
>>> xls = reannotate_positions(xls, "data/_fasta/Cas9_plus10.fasta")
>>> xls[0]["alpha_proteins"]
["Cas9"]
>>> xls[0]["alpha_proteins_crosslink_positions"]
[1293]
>>> xls[0]["beta_proteins"]
["Cas9"]
>>> xls[0]["beta_proteins_crosslink_positions"]
[48]
pyXLMS.transform.summary(
data: List[Dict[str, Any]] | Dict[str, Any],
) Dict[str, float][source]#

Extracts summary stats from a list of crosslinks or crosslink-spectrum-matches, or a parser_result.

Extracts summary statistics from a list of crosslinks or crosslink-spectrum-matches, or a parser_result. The statistic depend on the supplied data type, if a list of crosslinks is supplied a dictionary with the following statistics and keys is returned:

  • Number of crosslinks

  • Number of unique crosslinks by peptide

  • Number of unique crosslinks by protein

  • Number of intra crosslinks

  • Number of inter crosslinks

  • Number of target-target crosslinks

  • Number of target-decoy crosslinks

  • Number of decoy-decoy crosslinks

  • Minimum crosslink score

  • Maximum crosslink score

If a list of crosslink-spectrum-matches is supplied dictionary with the following statistics and keys is returned:

  • Number of CSMs

  • Number of unique CSMs

  • Number of intra CSMs

  • Number of inter CSMs

  • Number of target-target CSMs

  • Number of target-decoy CSMs

  • Number of decoy-decoy CSMs

  • Minimum CSM score

  • Maximum CSM score

If a parser_result is supplied, a dictionary with both containing all of these is returned - if they are available. A parser_result that only contains crosslinks will only yield a dictionary with crosslink statistics, and vice versa a parser_result that only contains crosslink-spectrum-matches will only yield a dictionary with crosslink-spectrum- match statistics. If the parser_result result contains both, then both dictionaries will be merged and returned. Please note that in this case a single dictionary is returned, that contains both the keys for crosslinks and crosslink-spectrum-matches.

Statistics are also printed to stdout.

Parameters:

data (list of dict of str, any, or dict of str, any) – A list of crosslinks or crosslink-spectrum-matches, or a parser_result.

Returns:

A dictionary with summary statistics.

Return type:

dict of str, float

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import summary
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> csms = pr["crosslink-spectrum-matches"]
>>> stats = summary(csms)
Number of CSMs: 826.0
Number of unique CSMs: 826.0
Number of intra CSMs: 803.0
Number of inter CSMs: 23.0
Number of target-target CSMs: 786.0
Number of target-decoy CSMs: 39.0
Number of decoy-decoy CSMs: 1.0
Minimum CSM score: 1.11
Maximum CSM score: 452.99
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import summary
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> stats = summary(pr)
Number of crosslinks: 300.0
Number of unique crosslinks by peptide: 300.0
Number of unique crosslinks by protein: 298.0
Number of intra crosslinks: 279.0
Number of inter crosslinks: 21.0
Number of target-target crosslinks: 265.0
Number of target-decoy crosslinks: 0.0
Number of decoy-decoy crosslinks: 35.0
Minimum crosslink score: 1.11
Maximum crosslink score: 452.99
pyXLMS.transform.targets_only(
data: List[Dict[str, Any]] | Dict[str, Any],
) List[Dict[str, Any]] | Dict[str, Any][source]#

Get target crosslinks or crosslink-spectrum-matches.

Get target crosslinks or crosslink-spectrum-matches from a list of target and decoy crosslinks or crosslink-spectrum-matches, or a parser_result. This effectively filters out any target-decoy and decoy-decoy matches and is essentially a convenience wrapper for transform.filter_target_decoy()["Target-Target"].

Parameters:

data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks, or a parser_result.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of target crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with target crosslink-spectrum-matches and/or target crosslinks will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:
  • TypeError – If a wrong data type is provided.

  • RuntimeError – If no target crosslinks or crosslink-spectrum-matches were found.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> targets = targets_only(result["crosslink-spectrum-matches"])
>>> len(targets)
786
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> targets = targets_only(result["crosslinks"])
>>> len(targets)
265
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> result = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS")
>>> result_targets = targets_only(result)
>>> len(result_targets["crosslink-spectrum-matches"])
786
>>> len(result_targets["crosslinks"])
265
pyXLMS.transform.to_dataframe(
data: List[Dict[str, Any]],
) DataFrame[source]#

Returns a pandas DataFrame of the given crosslinks or crosslink-spectrum-matches.

Parameters:

data (list) – A list of crosslinks or crosslink-spectrum-matches as created by data.create_crosslink() or data.create_csm().

Returns:

The pandas DataFrame created from the list of input crosslinks or crosslink-spectrum-matches. A full specification of the returned DataFrame can be found in the docs.

Return type:

pandas.DataFrame

Raises:
  • TypeError – If the list does not contain crosslinks or crosslink-spectrum-matches.

  • ValueError – If the list does not contain any objects.

Examples

>>> from pyXLMS.transform import to_dataframe
>>> # assume that crosslinks is a list of crosslinks created by data.create_crosslink()
>>> crosslink_dataframe = to_dataframe(crosslinks)
>>> # assume csms is a list of crosslink-spectrum-matches created by data.create_csm()
>>> csm_dataframe = to_dataframe(csms)
pyXLMS.transform.to_proforma(
data: Dict[str, Any] | List[Dict[str, Any]],
crosslinker: str | float | None = None,
) str | List[str][source]#

Returns the Proforma string for a single crosslink or crosslink-spectrum-match, or for a list of crosslinks or crosslink-spectrum-matches.

Parameters:
  • data (dict of str, any, or list of dict of str, any) – A pyXLMS crosslink object, e.g. see data.create_crosslink(). Or a pyXLMS crosslink-spectrum-match object, e.g. see data.create_csm(). Alternatively, a list of crosslinks or crosslink-spectrum-matches can also be provided.

  • crosslinker (str, or float, or None, default = None) – Optional name or mass of the crosslink reagent. If the name is given, it should be a valid name from XLMOD. If the crosslink modification is contained in the crosslink-spectrum-match object this parameter has no effect.

Returns:

The Proforma string of the crosslink or crosslink-spectrum-match. If a list was provided a list containing all Proforma strings is returned.

Return type:

str

Raises:

TypeError – If an unsupported data type is provided.

Notes

  • Modifications with unknown mass are skipped.

  • If no modifications are given, only the crosslink modification will be encoded in the Proforma.

  • If no modifications are given and no crosslinker is given, the unmodified peptide Proforma will be returned.

Examples

>>> from pyXLMS.data import create_crosslink_min
>>> from pyXLMS.transform import to_proforma
>>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1)
>>> to_proforma(xl)
'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_crosslink_min
>>> from pyXLMS.transform import to_proforma
>>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1)
>>> to_proforma(xl, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1)
>>> to_proforma(csm)
'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1)
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)})
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)}, charge=3)
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE/3'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3)
>>> to_proforma(csm)
'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3)
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'
pyXLMS.transform.unique(
data: List[Dict[str, Any]] | Dict[str, Any],
by: Literal['peptide', 'protein'] = 'peptide',
score: Literal['higher_better', 'lower_better'] = 'higher_better',
) List[Dict[str, Any]] | Dict[str, Any][source]#

Filter for unique crosslinks or crosslink-spectrum-matches.

Filters for unique crosslinks from a list on non-unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if by = "peptide", otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslinks without scores are provided, the first crosslink in the list is kept instead.

or

Filters for unique crosslink-spectrum-matches from a list on non-unique crosslink-spectrum-matches. A crosslink- spectrum-match is considered unique if there is no other crosslink-spectrum-match from the same spectrum file and with the same scan number. If more than one crosslink-spectrum-match exists per spectrum file and scan number, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the first crosslink-spectrum-match in the list is kept instead.

Parameters:
  • data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to filter, or a parser_result.

  • by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. Only affects filtering for unique crosslinks and not crosslink-spectrum-matches. If protein crosslink position is not available for all crosslinks a ValueError will be raised. Make sure that all crosslinks have the _proteins and _proteins_crosslink_positions fields set. If this is not already done by the parser, this can be achieved with transform.reannotate_positions().

  • score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of unique crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with unique crosslink-spectrum-matches and/or unique crosslinks will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.

  • TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.

  • ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import unique
>>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> len(pr["crosslinks"])
10
>>> unique_peptide = unique(pr, by="peptide")
>>> len(unique_peptide["crosslink-spectrum-matches"])
5
>>> len(unique_peptide["crosslinks"])
3
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import unique
>>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> len(pr["crosslinks"])
10
>>> unique_protein = unique(pr, by="protein")
>>> len(unique_protein["crosslink-spectrum-matches"])
5
>>> len(unique_protein["crosslinks"])
2
pyXLMS.transform.validate(
data: List[Dict[str, Any]] | Dict[str, Any],
fdr: float = 0.01,
formula: Literal['D/T', '(TD+DD)/TT', '(TD-DD)/TT'] = 'D/T',
score: Literal['higher_better', 'lower_better'] = 'higher_better',
separate_intra_inter: bool = False,
ignore_missing_labels: bool = False,
) List[Dict[str, Any]] | Dict[str, Any][source]#

Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate.

Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate (FDR) using the defined formula. Requires that “score”, “alpha_decoy” and “beta_decoy” fields are set for crosslinks and crosslink-spectrum-matches.

Parameters:
  • data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to validate, or a parser_result.

  • fdr (float, default = 0.01) – The target FDR, must be given as a real number between 0 and 1. The default of 0.01 corresponds to 1% FDR.

  • formula (str, one of "D/T", "(TD+DD)/TT", or "(TD-DD)/TT", default = "D/T") – Which formula to use to estimate FDR. D and DD denote decoy matches, T and TT denote target matches, and TD denotes target-decoy and decoy-target matches.

  • score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.

  • separate_intra_inter (bool, default = False) – If FDR should be estimated separately for intra and inter matches.

  • ignore_missing_labels (bool, default = False) – If crosslinks and crosslink-spectrum-matches should be ignored if they don’t have target and decoy labels. By default and error is thrown if any unlabelled data is encountered.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of validated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an parser_result with validated crosslink-spectrum-matches and/or validated crosslinks will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If parameter formula is not one of ‘D/T’, ‘(TD+DD)/TT’, or ‘(TD-DD)/TT’.

  • TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.

  • ValueError – If parameter fdr is outside of the supported range.

  • ValueError – If attribute ‘score’ is not available for any of the data.

  • ValueError – If attribute ‘alpha_decoy’ or ‘beta_decoy’ is not available for any of the data and parameter ignore_missing_labels is set to False.

  • ValueError – If the number of DD matches exceeds the number of TD matches for formula ‘(TD-DD)/TT’. FDR can not be estimated with the formula ‘(TD-DD)/TT’ in these cases.

Notes

Please note that progress bars will usually not complete when running this function. This is by design as it is not necessary to iterate over all scores to estimate FDR.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import validate
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> csms = pr["crosslink-spectrum-matches"]
>>> len(csms)
826
>>> validated = validate(csms)
>>> len(validated)
705
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import validate
>>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
826
>>> len(pr["crosslinks"])
300
>>> validated = validate(pr)
>>> len(validated["crosslink-spectrum-matches"])
705
>> len(validated["crosslinks"])
226
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import validate
>>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
826
>>> len(pr["crosslinks"])
300
>>> validated = validate(pr, fdr=0.05)
>>> len(validated["crosslink-spectrum-matches"])
825
>> len(validated["crosslinks"])
260