pyXLMS.transform package#
Submodules#
pyXLMS.transform.aggregate module#
- pyXLMS.transform.aggregate.aggregate(
- csms: List[Dict[str, Any]],
- by: Literal['peptide', 'protein'] = 'peptide',
- score: Literal['higher_better', 'lower_better'] = 'higher_better',
Aggregate crosslink-spectrum-matches to crosslinks.
Aggregates a list of crosslink-spectrum-matches to unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if
by = "peptide"
, otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the crosslink of the first corresponding crosslink-spectrum -match in the list is kept instead.- Parameters:
csms (list of dict of str, any) – A list of crosslink-spectrum-matches.
by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. If protein crosslink position is not available for all crosslink-spectrum-matches a
ValueError
will be raised. Make sure that all crosslink-spectrum-matches have the_proteins
and_proteins_crosslink_positions
fields set. If this is not already done by the parser, this can be achieved withtransform.reannotate_positions()
.score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.
- Returns:
A list of aggregated, unique crosslinks.
- Return type:
list of dict of str, any
Warning
Aggregation will not conserve false discovery rate (FDR)! Aggregating crosslink-spectrum-matches that are validated for 1% FDR will not result in crosslinks validated for 1% FDR! Aggregated crosslinks should be validated with either external tools or with the built-in
transform.validate()
!- Raises:
TypeError – If a wrong data type is provided.
TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.
TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.
ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import aggregate >>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> aggregate_peptide = aggregate(pr["crosslink-spectrum-matches"], by="peptide") >>> len(aggregate_peptide) 3
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import aggregate >>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> aggregate_protein = aggregate(pr["crosslink-spectrum-matches"], by="protein") >>> len(aggregate_protein) 2
- pyXLMS.transform.aggregate.unique(
- data: List[Dict[str, Any]] | Dict[str, Any],
- by: Literal['peptide', 'protein'] = 'peptide',
- score: Literal['higher_better', 'lower_better'] = 'higher_better',
Filter for unique crosslinks or crosslink-spectrum-matches.
Filters for unique crosslinks from a list on non-unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if
by = "peptide"
, otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslinks without scores are provided, the first crosslink in the list is kept instead.or
Filters for unique crosslink-spectrum-matches from a list on non-unique crosslink-spectrum-matches. A crosslink- spectrum-match is considered unique if there is no other crosslink-spectrum-match from the same spectrum file and with the same scan number. If more than one crosslink-spectrum-match exists per spectrum file and scan number, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the first crosslink-spectrum-match in the list is kept instead.
- Parameters:
data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to filter, or a parser_result.
by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. Only affects filtering for unique crosslinks and not crosslink-spectrum-matches. If protein crosslink position is not available for all crosslinks a
ValueError
will be raised. Make sure that all crosslinks have the_proteins
and_proteins_crosslink_positions
fields set. If this is not already done by the parser, this can be achieved withtransform.reannotate_positions()
.score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of unique crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with unique crosslink-spectrum-matches and/or unique crosslinks will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.
TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.
ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import unique >>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> len(pr["crosslinks"]) 10 >>> unique_peptide = unique(pr, by="peptide") >>> len(unique_peptide["crosslink-spectrum-matches"]) 5 >>> len(unique_peptide["crosslinks"]) 3
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import unique >>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> len(pr["crosslinks"]) 10 >>> unique_protein = unique(pr, by="protein") >>> len(unique_protein["crosslink-spectrum-matches"]) 5 >>> len(unique_protein["crosslinks"]) 2
pyXLMS.transform.filter module#
- pyXLMS.transform.filter.filter_crosslink_type(
- data: List[Dict[str, Any]],
Separate crosslinks and crosslink-spectrum-matches by their crosslink type.
Gets all crosslinks or crosslink-spectrum-matches depending on crosslink type. Will separate based on if a crosslink or crosslink-spectrum-match is of type “intra” or “inter” crosslink.
- Parameters:
data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.
- Returns:
Returns a dictionary with key
Intra
which contains all crosslinks or crosslink-spectrum- matches with crosslink type = “intra”, and keyInter
which contains all crosslinks or crosslink-spectrum-matches with crosslink type = “inter”.- Return type:
dict of str, list of dict
- Raises:
TypeError – If an unsupported data type is provided.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_crosslink_type >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> crosslink_type_filtered_csms = filter_crosslink_type(result["crosslink-spectrum-matches"]) >>> len(crosslink_type_filtered_csms["Intra"]) 803 >>> len(crosslink_type_filtered_csms["Inter"]) 23
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_crosslink_type >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> crosslink_type_filtered_crosslinks = filter_crosslink_type(result["crosslinks"]) >>> len(crosslink_type_filtered_crosslinks["Intra"]) 279 >>> len(crosslink_type_filtered_crosslinks["Inter"]) 21
- pyXLMS.transform.filter.filter_proteins(
- data: List[Dict[str, Any]],
- proteins: Set[str] | List[str],
Get all crosslinks or crosslink-spectrum-matches originating from proteins of interest.
Gets all crosslinks or crosslink-spectrum-matches originating from a list of proteins of interest and returns a list of crosslinks or crosslink-spectrum-matches where both peptides come from a protein of interest and a list of crosslinks or crosslink-spectrum-matches where one of the peptides comes from a protein of interest.
- Parameters:
data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.
proteins (set of str, or list of str) – A set of protein accessions of interest.
- Returns:
Returns a dictionary with key
Proteins
which contains the list of proteins of interest, keyBoth
which contains all crosslinks or crosslink-spectrum-matches where both peptides are originating from a protein of interest, and keyOne
which contains all crosslinks or crosslink-spectrum-matches where one of the two peptides is originating from a protein of interest.- Return type:
dict
- Raises:
TypeError – If an unsupported data type is provided.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_proteins >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> proteins_csms = filter_proteins(result["crosslink-spectrum-matches"], ["Cas9"]) >>> proteins_csms["Proteins"] ['Cas9'] >>> len(proteins_csms["Both"]) 798 >>> len(proteins_csms["One"]) 23
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_proteins >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> proteins_xls = filter_proteins(result["crosslinks"], ["Cas9"]) >>> proteins_xls["Proteins"] ['Cas9'] >>> len(proteins_xls["Both"]) 274 >>> len(proteins_xls["One"]) 21
- pyXLMS.transform.filter.filter_target_decoy(
- data: List[Dict[str, Any]],
Seperate crosslinks or crosslink-spectrum-matches based on target and decoy matches.
Seperates crosslinks or crosslink-spectrum-matches based on if both peptides match to the target database, or if both match to the decoy database, or if one of them matches to the target database and the other to the decoy database. The first we denote as “Target-Target” or “TT” matches, the second as “Decoy-Decoy” or “DD” matches, and the third as “Target-Decoy” or “TD” matches.
- Parameters:
data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.
- Returns:
Returns a dictionary with key
Target-Target
which contains all TT matches, keyTarget-Decoy
which contains all TD matches, and keyDecoy-Decoy
which contains all DD matches.- Return type:
dict
- Raises:
TypeError – If an unsupported data type is provided.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_target_decoy >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> target_and_decoys = filter_target_decoy(result["crosslink-spectrum-matches"]) >>> len(target_and_decoys["Target-Target"]) 786 >>> len(target_and_decoys["Target-Decoy"]) 39 >>> len(target_and_decoys["Decoy-Decoy"]) 1
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_target_decoy >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> target_and_decoys = filter_target_decoy(result["crosslinks"]) >>> len(target_and_decoys["Target-Target"]) 265 >>> len(target_and_decoys["Target-Decoy"]) 0 >>> len(target_and_decoys["Decoy-Decoy"]) 35
pyXLMS.transform.reannotate_positions module#
- pyXLMS.transform.reannotate_positions.fasta_title_to_accession(title: str) str [source]#
Parses the protein accession from a UniProt-like title.
- Parameters:
title (str) – Fasta title/header.
- Returns:
The protein accession parsed from the title. If parsing was unsuccessful the full title is returned.
- Return type:
str
Examples
>>> from pyXLMS.transform import fasta_title_to_accession >>> title = "sp|A0A087X1C5|CP2D7_HUMAN Putative cytochrome P450 2D7 OS=Homo sapiens OX=9606 GN=CYP2D7 PE=5 SV=1" >>> fasta_title_to_accession(title) 'A0A087X1C5'
>>> from pyXLMS.transform import fasta_title_to_accession >>> title = "Cas9" >>> fasta_title_to_accession(title) 'Cas9'
- pyXLMS.transform.reannotate_positions.reannotate_positions(
- data: List[Dict[str, Any]] | Dict[str, Any],
- fasta: str | BinaryIO,
- title_to_accession: Callable[[str], str] | None = None,
Reannotates protein crosslink positions for a given fasta file.
Reannotates the crosslink and peptide positions of the given cross-linked peptide pair and the specified fasta file. Takes a list of crosslink-spectrum-matches or crosslinks, or a parser_result as input.
- Parameters:
data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to annotate, or a parser_result.
fasta (str, or file stream) – The name/path of the fasta file containing protein sequences or a file-like object/stream.
title_to_accession (callable, or None, default = None) – A function that parses the protein accession from the fasta title/header. If None (default) the function
fasta_title_to_accession
is used.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of annotated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an annotated parser_result will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
Examples
>>> from pyXLMS.data import create_crosslink_min >>> from pyXLMS.transform import reannotate_positions >>> xls = [create_crosslink_min("ADANLDK", 7, "GNTDRHSIK", 9)] >>> xls = reannotate_positions(xls, "data/_fasta/Cas9_plus10.fasta") >>> xls[0]["alpha_proteins"] ["Cas9"] >>> xls[0]["alpha_proteins_crosslink_positions"] [1293] >>> xls[0]["beta_proteins"] ["Cas9"] >>> xls[0]["beta_proteins_crosslink_positions"] [48]
pyXLMS.transform.summary module#
- pyXLMS.transform.summary.summary(
- data: List[Dict[str, Any]] | Dict[str, Any],
Extracts summary stats from a list of crosslinks or crosslink-spectrum-matches, or a parser_result.
Extracts summary statistics from a list of crosslinks or crosslink-spectrum-matches, or a parser_result. The statistic depend on the supplied data type, if a list of crosslinks is supplied a dictionary with the following statistics and keys is returned:
Number of crosslinks
Number of unique crosslinks by peptide
Number of unique crosslinks by protein
Number of intra crosslinks
Number of inter crosslinks
Number of target-target crosslinks
Number of target-decoy crosslinks
Number of decoy-decoy crosslinks
Minimum crosslink score
Maximum crosslink score
If a list of crosslink-spectrum-matches is supplied dictionary with the following statistics and keys is returned:
Number of CSMs
Number of unique CSMs
Number of intra CSMs
Number of inter CSMs
Number of target-target CSMs
Number of target-decoy CSMs
Number of decoy-decoy CSMs
Minimum CSM score
Maximum CSM score
If a parser_result is supplied, a dictionary with both containing all of these is returned - if they are available. A parser_result that only contains crosslinks will only yield a dictionary with crosslink statistics, and vice versa a parser_result that only contains crosslink-spectrum-matches will only yield a dictionary with crosslink-spectrum- match statistics. If the parser_result result contains both, then both dictionaries will be merged and returned. Please note that in this case a single dictionary is returned, that contains both the keys for crosslinks and crosslink-spectrum-matches.
Statistics are also printed to
stdout
.- Parameters:
data (list of dict of str, any, or dict of str, any) – A list of crosslinks or crosslink-spectrum-matches, or a parser_result.
- Returns:
A dictionary with summary statistics.
- Return type:
dict of str, float
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import summary >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> csms = pr["crosslink-spectrum-matches"] >>> stats = summary(csms) Number of CSMs: 826.0 Number of unique CSMs: 826.0 Number of intra CSMs: 803.0 Number of inter CSMs: 23.0 Number of target-target CSMs: 786.0 Number of target-decoy CSMs: 39.0 Number of decoy-decoy CSMs: 1.0 Minimum CSM score: 1.11 Maximum CSM score: 452.99
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import summary >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> stats = summary(pr) Number of crosslinks: 300.0 Number of unique crosslinks by peptide: 300.0 Number of unique crosslinks by protein: 298.0 Number of intra crosslinks: 279.0 Number of inter crosslinks: 21.0 Number of target-target crosslinks: 265.0 Number of target-decoy crosslinks: 0.0 Number of decoy-decoy crosslinks: 35.0 Minimum crosslink score: 1.11 Maximum crosslink score: 452.99
pyXLMS.transform.targets_only module#
- pyXLMS.transform.targets_only.targets_only(
- data: List[Dict[str, Any]] | Dict[str, Any],
Get target crosslinks or crosslink-spectrum-matches.
Get target crosslinks or crosslink-spectrum-matches from a list of target and decoy crosslinks or crosslink-spectrum-matches, or a parser_result. This effectively filters out any target-decoy and decoy-decoy matches and is essentially a convenience wrapper for
transform.filter_target_decoy()["Target-Target"]
.- Parameters:
data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks, or a parser_result.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of target crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with target crosslink-spectrum-matches and/or target crosslinks will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
RuntimeError – If no target crosslinks or crosslink-spectrum-matches were found.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> targets = targets_only(result["crosslink-spectrum-matches"]) >>> len(targets) 786
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> targets = targets_only(result["crosslinks"]) >>> len(targets) 265
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> result = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS") >>> result_targets = targets_only(result) >>> len(result_targets["crosslink-spectrum-matches"]) 786 >>> len(result_targets["crosslinks"]) 265
pyXLMS.transform.to_dataframe module#
- pyXLMS.transform.to_dataframe.to_dataframe(
- data: List[Dict[str, Any]],
Returns a pandas DataFrame of the given crosslinks or crosslink-spectrum-matches.
- Parameters:
data (list) – A list of crosslinks or crosslink-spectrum-matches as created by
data.create_crosslink()
ordata.create_csm()
.- Returns:
The pandas DataFrame created from the list of input crosslinks or crosslink-spectrum-matches. A full specification of the returned DataFrame can be found in the docs.
- Return type:
pandas.DataFrame
- Raises:
TypeError – If the list does not contain crosslinks or crosslink-spectrum-matches.
ValueError – If the list does not contain any objects.
Examples
>>> from pyXLMS.transform import to_dataframe >>> # assume that crosslinks is a list of crosslinks created by data.create_crosslink() >>> crosslink_dataframe = to_dataframe(crosslinks) >>> # assume csms is a list of crosslink-spectrum-matches created by data.create_csm() >>> csm_dataframe = to_dataframe(csms)
pyXLMS.transform.to_proforma module#
- pyXLMS.transform.to_proforma.to_proforma(
- data: Dict[str, Any] | List[Dict[str, Any]],
- crosslinker: str | float | None = None,
Returns the Proforma string for a single crosslink or crosslink-spectrum-match, or for a list of crosslinks or crosslink-spectrum-matches.
- Parameters:
data (dict of str, any, or list of dict of str, any) – A pyXLMS crosslink object, e.g. see
data.create_crosslink()
. Or a pyXLMS crosslink-spectrum-match object, e.g. seedata.create_csm()
. Alternatively, a list of crosslinks or crosslink-spectrum-matches can also be provided.crosslinker (str, or float, or None, default = None) – Optional name or mass of the crosslink reagent. If the name is given, it should be a valid name from XLMOD. If the crosslink modification is contained in the crosslink-spectrum-match object this parameter has no effect.
- Returns:
The Proforma string of the crosslink or crosslink-spectrum-match. If a list was provided a list containing all Proforma strings is returned.
- Return type:
str
- Raises:
TypeError – If an unsupported data type is provided.
Notes
Modifications with unknown mass are skipped.
If no modifications are given, only the crosslink modification will be encoded in the Proforma.
If no modifications are given and no crosslinker is given, the unmodified peptide Proforma will be returned.
Examples
>>> from pyXLMS.data import create_crosslink_min >>> from pyXLMS.transform import to_proforma >>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1) >>> to_proforma(xl) 'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_crosslink_min >>> from pyXLMS.transform import to_proforma >>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1) >>> to_proforma(xl, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1) >>> to_proforma(csm) 'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)}) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)}, charge=3) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE/3'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3) >>> to_proforma(csm) 'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'
pyXLMS.transform.util module#
- pyXLMS.transform.util.assert_data_type_same(data_list: List[Dict[str, Any]]) bool [source]#
Checks that all data is of the same data type.
Verifies that all elements in the provided list are of the same data type.
- Parameters:
data_list (list of dict of str, any) – A list of dictionaries with the
data_type
key.- Returns:
If all elements are of the same data type.
- Return type:
bool
Examples
>>> from pyXLMS.transform import assert_data_type_same >>> from pyXLMS import data >>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)] >>> assert_data_type_same(data_list) True
>>> from pyXLMS.transform import assert_data_type_same >>> from pyXLMS import data >>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_csm_min("KPEP", 1, "PEKP", 3, "RUN_1", 1)] >>> assert_data_type_same(data_list) False
- pyXLMS.transform.util.get_available_keys(
- data_list: List[Dict[str, Any]],
Checks which data is available from a list of crosslinks or crosslink-spectrum-matches.
Verifies which data fields have been set for all crosslinks or crosslink-spectrum-matches in the given list. Will return a dictionary structured the same as a crosslink or crosslink-spectrum-match, but instead of the data it will return either True or False, depending if the field was set or not.
- Parameters:
data_list (list of dict of str, any) – A list of crosslinks or crosslink-spectrum-matches.
- Returns:
If a list of crosslinks was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslinks in
data_list
. Keys:data_type
,completeness
,alpha_peptide
,alpha_peptide_crosslink_position
,alpha_proteins
,alpha_proteins_crosslink_positions
,alpha_decoy
,beta_peptide
,beta_peptide_crosslink_position
,beta_proteins
,beta_proteins_crosslink_positions
,beta_decoy
,crosslink_type
,score
, andadditional_information
.If a list of crosslink-spectrum-matches was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslink-spectrum-matches in
data_list
. Keys:data_type
,completeness
,alpha_peptide
,alpha_modifications
,alpha_peptide_crosslink_position
,alpha_proteins
,alpha_proteins_crosslink_positions
,alpha_proteins_peptide_positions
,alpha_score
,alpha_decoy
,beta_peptide
,beta_modifications
,beta_peptide_crosslink_position
,beta_proteins
,beta_proteins_crosslink_positions
,beta_proteins_peptide_positions
,beta_score
,beta_decoy
,crosslink_type
,score
,spectrum_file
,scan_nr
,retention_time
,ion_mobility
, andadditional_information
.
- Return type:
dict of str, bool
- Raises:
TypeError – If not all elements in
data_list
are of the same data type.TypeError – If one or more elements in the list are of an unsupported data type.
Examples
>>> from pyXLMS.transform import get_available_keys >>> from pyXLMS import data >>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)] >>> available_keys = get_available_keys(data_list) >>> available_keys["alpha_peptide"] True >>> available_keys["score"] False
- pyXLMS.transform.util.modifications_to_str(
- modifications: Dict[int, Tuple[str, float]] | None,
Returns the string representation of a modifications dictionary.
- Parameters:
modifications (dict of [str, tuple], or None) – The modifications of a peptide given as a dictionary that maps peptide position (1-based) to modification given as a tuple of modification name and modification delta mass.
N-terminal
modifications should be denoted with position0
.C-terminal
modifications should be denoted with positionlen(peptide) + 1
.- Returns:
The string representation of the modifications (or
None
if no modification was provided).- Return type:
str, or None
Examples
>>> from pyXLMS.transform import modifications_to_str >>> modifications_to_str({1: ("Oxidation", 15.994915), 5: ("Carbamidomethyl", 57.021464)}) '(1:[Oxidation|15.994915]);(5:[Carbamidomethyl|57.021464])'
pyXLMS.transform.validate module#
- pyXLMS.transform.validate.validate(
- data: List[Dict[str, Any]] | Dict[str, Any],
- fdr: float = 0.01,
- formula: Literal['D/T', '(TD+DD)/TT', '(TD-DD)/TT'] = 'D/T',
- score: Literal['higher_better', 'lower_better'] = 'higher_better',
- separate_intra_inter: bool = False,
- ignore_missing_labels: bool = False,
Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate.
Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate (FDR) using the defined formula. Requires that “score”, “alpha_decoy” and “beta_decoy” fields are set for crosslinks and crosslink-spectrum-matches.
- Parameters:
data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to validate, or a parser_result.
fdr (float, default = 0.01) – The target FDR, must be given as a real number between 0 and 1. The default of 0.01 corresponds to 1% FDR.
formula (str, one of "D/T", "(TD+DD)/TT", or "(TD-DD)/TT", default = "D/T") – Which formula to use to estimate FDR. D and DD denote decoy matches, T and TT denote target matches, and TD denotes target-decoy and decoy-target matches.
score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.
separate_intra_inter (bool, default = False) – If FDR should be estimated separately for intra and inter matches.
ignore_missing_labels (bool, default = False) – If crosslinks and crosslink-spectrum-matches should be ignored if they don’t have target and decoy labels. By default and error is thrown if any unlabelled data is encountered.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of validated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an parser_result with validated crosslink-spectrum-matches and/or validated crosslinks will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If parameter formula is not one of ‘D/T’, ‘(TD+DD)/TT’, or ‘(TD-DD)/TT’.
TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.
ValueError – If parameter fdr is outside of the supported range.
ValueError – If attribute ‘score’ is not available for any of the data.
ValueError – If attribute ‘alpha_decoy’ or ‘beta_decoy’ is not available for any of the data and parameter ignore_missing_labels is set to False.
ValueError – If the number of DD matches exceeds the number of TD matches for formula ‘(TD-DD)/TT’. FDR can not be estimated with the formula ‘(TD-DD)/TT’ in these cases.
Notes
Please note that progress bars will usually not complete when running this function. This is by design as it is not necessary to iterate over all scores to estimate FDR.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import validate >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> csms = pr["crosslink-spectrum-matches"] >>> len(csms) 826 >>> validated = validate(csms) >>> len(validated) 705
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import validate >>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 826 >>> len(pr["crosslinks"]) 300 >>> validated = validate(pr) >>> len(validated["crosslink-spectrum-matches"]) 705 >> len(validated["crosslinks"]) 226
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import validate >>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 826 >>> len(pr["crosslinks"]) 300 >>> validated = validate(pr, fdr=0.05) >>> len(validated["crosslink-spectrum-matches"]) 825 >> len(validated["crosslinks"]) 260
Module contents#
- pyXLMS.transform.aggregate(
- csms: List[Dict[str, Any]],
- by: Literal['peptide', 'protein'] = 'peptide',
- score: Literal['higher_better', 'lower_better'] = 'higher_better',
Aggregate crosslink-spectrum-matches to crosslinks.
Aggregates a list of crosslink-spectrum-matches to unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if
by = "peptide"
, otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the crosslink of the first corresponding crosslink-spectrum -match in the list is kept instead.- Parameters:
csms (list of dict of str, any) – A list of crosslink-spectrum-matches.
by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. If protein crosslink position is not available for all crosslink-spectrum-matches a
ValueError
will be raised. Make sure that all crosslink-spectrum-matches have the_proteins
and_proteins_crosslink_positions
fields set. If this is not already done by the parser, this can be achieved withtransform.reannotate_positions()
.score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.
- Returns:
A list of aggregated, unique crosslinks.
- Return type:
list of dict of str, any
Warning
Aggregation will not conserve false discovery rate (FDR)! Aggregating crosslink-spectrum-matches that are validated for 1% FDR will not result in crosslinks validated for 1% FDR! Aggregated crosslinks should be validated with either external tools or with the built-in
transform.validate()
!- Raises:
TypeError – If a wrong data type is provided.
TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.
TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.
ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import aggregate >>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> aggregate_peptide = aggregate(pr["crosslink-spectrum-matches"], by="peptide") >>> len(aggregate_peptide) 3
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import aggregate >>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> aggregate_protein = aggregate(pr["crosslink-spectrum-matches"], by="protein") >>> len(aggregate_protein) 2
- pyXLMS.transform.assert_data_type_same(data_list: List[Dict[str, Any]]) bool [source]#
Checks that all data is of the same data type.
Verifies that all elements in the provided list are of the same data type.
- Parameters:
data_list (list of dict of str, any) – A list of dictionaries with the
data_type
key.- Returns:
If all elements are of the same data type.
- Return type:
bool
Examples
>>> from pyXLMS.transform import assert_data_type_same >>> from pyXLMS import data >>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)] >>> assert_data_type_same(data_list) True
>>> from pyXLMS.transform import assert_data_type_same >>> from pyXLMS import data >>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_csm_min("KPEP", 1, "PEKP", 3, "RUN_1", 1)] >>> assert_data_type_same(data_list) False
- pyXLMS.transform.fasta_title_to_accession(title: str) str [source]#
Parses the protein accession from a UniProt-like title.
- Parameters:
title (str) – Fasta title/header.
- Returns:
The protein accession parsed from the title. If parsing was unsuccessful the full title is returned.
- Return type:
str
Examples
>>> from pyXLMS.transform import fasta_title_to_accession >>> title = "sp|A0A087X1C5|CP2D7_HUMAN Putative cytochrome P450 2D7 OS=Homo sapiens OX=9606 GN=CYP2D7 PE=5 SV=1" >>> fasta_title_to_accession(title) 'A0A087X1C5'
>>> from pyXLMS.transform import fasta_title_to_accession >>> title = "Cas9" >>> fasta_title_to_accession(title) 'Cas9'
- pyXLMS.transform.filter_crosslink_type(
- data: List[Dict[str, Any]],
Separate crosslinks and crosslink-spectrum-matches by their crosslink type.
Gets all crosslinks or crosslink-spectrum-matches depending on crosslink type. Will separate based on if a crosslink or crosslink-spectrum-match is of type “intra” or “inter” crosslink.
- Parameters:
data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.
- Returns:
Returns a dictionary with key
Intra
which contains all crosslinks or crosslink-spectrum- matches with crosslink type = “intra”, and keyInter
which contains all crosslinks or crosslink-spectrum-matches with crosslink type = “inter”.- Return type:
dict of str, list of dict
- Raises:
TypeError – If an unsupported data type is provided.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_crosslink_type >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> crosslink_type_filtered_csms = filter_crosslink_type(result["crosslink-spectrum-matches"]) >>> len(crosslink_type_filtered_csms["Intra"]) 803 >>> len(crosslink_type_filtered_csms["Inter"]) 23
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_crosslink_type >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> crosslink_type_filtered_crosslinks = filter_crosslink_type(result["crosslinks"]) >>> len(crosslink_type_filtered_crosslinks["Intra"]) 279 >>> len(crosslink_type_filtered_crosslinks["Inter"]) 21
- pyXLMS.transform.filter_proteins(
- data: List[Dict[str, Any]],
- proteins: Set[str] | List[str],
Get all crosslinks or crosslink-spectrum-matches originating from proteins of interest.
Gets all crosslinks or crosslink-spectrum-matches originating from a list of proteins of interest and returns a list of crosslinks or crosslink-spectrum-matches where both peptides come from a protein of interest and a list of crosslinks or crosslink-spectrum-matches where one of the peptides comes from a protein of interest.
- Parameters:
data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.
proteins (set of str, or list of str) – A set of protein accessions of interest.
- Returns:
Returns a dictionary with key
Proteins
which contains the list of proteins of interest, keyBoth
which contains all crosslinks or crosslink-spectrum-matches where both peptides are originating from a protein of interest, and keyOne
which contains all crosslinks or crosslink-spectrum-matches where one of the two peptides is originating from a protein of interest.- Return type:
dict
- Raises:
TypeError – If an unsupported data type is provided.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_proteins >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> proteins_csms = filter_proteins(result["crosslink-spectrum-matches"], ["Cas9"]) >>> proteins_csms["Proteins"] ['Cas9'] >>> len(proteins_csms["Both"]) 798 >>> len(proteins_csms["One"]) 23
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_proteins >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> proteins_xls = filter_proteins(result["crosslinks"], ["Cas9"]) >>> proteins_xls["Proteins"] ['Cas9'] >>> len(proteins_xls["Both"]) 274 >>> len(proteins_xls["One"]) 21
- pyXLMS.transform.filter_target_decoy(
- data: List[Dict[str, Any]],
Seperate crosslinks or crosslink-spectrum-matches based on target and decoy matches.
Seperates crosslinks or crosslink-spectrum-matches based on if both peptides match to the target database, or if both match to the decoy database, or if one of them matches to the target database and the other to the decoy database. The first we denote as “Target-Target” or “TT” matches, the second as “Decoy-Decoy” or “DD” matches, and the third as “Target-Decoy” or “TD” matches.
- Parameters:
data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.
- Returns:
Returns a dictionary with key
Target-Target
which contains all TT matches, keyTarget-Decoy
which contains all TD matches, and keyDecoy-Decoy
which contains all DD matches.- Return type:
dict
- Raises:
TypeError – If an unsupported data type is provided.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_target_decoy >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> target_and_decoys = filter_target_decoy(result["crosslink-spectrum-matches"]) >>> len(target_and_decoys["Target-Target"]) 786 >>> len(target_and_decoys["Target-Decoy"]) 39 >>> len(target_and_decoys["Decoy-Decoy"]) 1
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_target_decoy >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> target_and_decoys = filter_target_decoy(result["crosslinks"]) >>> len(target_and_decoys["Target-Target"]) 265 >>> len(target_and_decoys["Target-Decoy"]) 0 >>> len(target_and_decoys["Decoy-Decoy"]) 35
- pyXLMS.transform.get_available_keys(
- data_list: List[Dict[str, Any]],
Checks which data is available from a list of crosslinks or crosslink-spectrum-matches.
Verifies which data fields have been set for all crosslinks or crosslink-spectrum-matches in the given list. Will return a dictionary structured the same as a crosslink or crosslink-spectrum-match, but instead of the data it will return either True or False, depending if the field was set or not.
- Parameters:
data_list (list of dict of str, any) – A list of crosslinks or crosslink-spectrum-matches.
- Returns:
If a list of crosslinks was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslinks in
data_list
. Keys:data_type
,completeness
,alpha_peptide
,alpha_peptide_crosslink_position
,alpha_proteins
,alpha_proteins_crosslink_positions
,alpha_decoy
,beta_peptide
,beta_peptide_crosslink_position
,beta_proteins
,beta_proteins_crosslink_positions
,beta_decoy
,crosslink_type
,score
, andadditional_information
.If a list of crosslink-spectrum-matches was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslink-spectrum-matches in
data_list
. Keys:data_type
,completeness
,alpha_peptide
,alpha_modifications
,alpha_peptide_crosslink_position
,alpha_proteins
,alpha_proteins_crosslink_positions
,alpha_proteins_peptide_positions
,alpha_score
,alpha_decoy
,beta_peptide
,beta_modifications
,beta_peptide_crosslink_position
,beta_proteins
,beta_proteins_crosslink_positions
,beta_proteins_peptide_positions
,beta_score
,beta_decoy
,crosslink_type
,score
,spectrum_file
,scan_nr
,retention_time
,ion_mobility
, andadditional_information
.
- Return type:
dict of str, bool
- Raises:
TypeError – If not all elements in
data_list
are of the same data type.TypeError – If one or more elements in the list are of an unsupported data type.
Examples
>>> from pyXLMS.transform import get_available_keys >>> from pyXLMS import data >>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)] >>> available_keys = get_available_keys(data_list) >>> available_keys["alpha_peptide"] True >>> available_keys["score"] False
- pyXLMS.transform.modifications_to_str(
- modifications: Dict[int, Tuple[str, float]] | None,
Returns the string representation of a modifications dictionary.
- Parameters:
modifications (dict of [str, tuple], or None) – The modifications of a peptide given as a dictionary that maps peptide position (1-based) to modification given as a tuple of modification name and modification delta mass.
N-terminal
modifications should be denoted with position0
.C-terminal
modifications should be denoted with positionlen(peptide) + 1
.- Returns:
The string representation of the modifications (or
None
if no modification was provided).- Return type:
str, or None
Examples
>>> from pyXLMS.transform import modifications_to_str >>> modifications_to_str({1: ("Oxidation", 15.994915), 5: ("Carbamidomethyl", 57.021464)}) '(1:[Oxidation|15.994915]);(5:[Carbamidomethyl|57.021464])'
- pyXLMS.transform.reannotate_positions(
- data: List[Dict[str, Any]] | Dict[str, Any],
- fasta: str | BinaryIO,
- title_to_accession: Callable[[str], str] | None = None,
Reannotates protein crosslink positions for a given fasta file.
Reannotates the crosslink and peptide positions of the given cross-linked peptide pair and the specified fasta file. Takes a list of crosslink-spectrum-matches or crosslinks, or a parser_result as input.
- Parameters:
data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to annotate, or a parser_result.
fasta (str, or file stream) – The name/path of the fasta file containing protein sequences or a file-like object/stream.
title_to_accession (callable, or None, default = None) – A function that parses the protein accession from the fasta title/header. If None (default) the function
fasta_title_to_accession
is used.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of annotated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an annotated parser_result will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
Examples
>>> from pyXLMS.data import create_crosslink_min >>> from pyXLMS.transform import reannotate_positions >>> xls = [create_crosslink_min("ADANLDK", 7, "GNTDRHSIK", 9)] >>> xls = reannotate_positions(xls, "data/_fasta/Cas9_plus10.fasta") >>> xls[0]["alpha_proteins"] ["Cas9"] >>> xls[0]["alpha_proteins_crosslink_positions"] [1293] >>> xls[0]["beta_proteins"] ["Cas9"] >>> xls[0]["beta_proteins_crosslink_positions"] [48]
- pyXLMS.transform.summary(
- data: List[Dict[str, Any]] | Dict[str, Any],
Extracts summary stats from a list of crosslinks or crosslink-spectrum-matches, or a parser_result.
Extracts summary statistics from a list of crosslinks or crosslink-spectrum-matches, or a parser_result. The statistic depend on the supplied data type, if a list of crosslinks is supplied a dictionary with the following statistics and keys is returned:
Number of crosslinks
Number of unique crosslinks by peptide
Number of unique crosslinks by protein
Number of intra crosslinks
Number of inter crosslinks
Number of target-target crosslinks
Number of target-decoy crosslinks
Number of decoy-decoy crosslinks
Minimum crosslink score
Maximum crosslink score
If a list of crosslink-spectrum-matches is supplied dictionary with the following statistics and keys is returned:
Number of CSMs
Number of unique CSMs
Number of intra CSMs
Number of inter CSMs
Number of target-target CSMs
Number of target-decoy CSMs
Number of decoy-decoy CSMs
Minimum CSM score
Maximum CSM score
If a parser_result is supplied, a dictionary with both containing all of these is returned - if they are available. A parser_result that only contains crosslinks will only yield a dictionary with crosslink statistics, and vice versa a parser_result that only contains crosslink-spectrum-matches will only yield a dictionary with crosslink-spectrum- match statistics. If the parser_result result contains both, then both dictionaries will be merged and returned. Please note that in this case a single dictionary is returned, that contains both the keys for crosslinks and crosslink-spectrum-matches.
Statistics are also printed to
stdout
.- Parameters:
data (list of dict of str, any, or dict of str, any) – A list of crosslinks or crosslink-spectrum-matches, or a parser_result.
- Returns:
A dictionary with summary statistics.
- Return type:
dict of str, float
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import summary >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> csms = pr["crosslink-spectrum-matches"] >>> stats = summary(csms) Number of CSMs: 826.0 Number of unique CSMs: 826.0 Number of intra CSMs: 803.0 Number of inter CSMs: 23.0 Number of target-target CSMs: 786.0 Number of target-decoy CSMs: 39.0 Number of decoy-decoy CSMs: 1.0 Minimum CSM score: 1.11 Maximum CSM score: 452.99
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import summary >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> stats = summary(pr) Number of crosslinks: 300.0 Number of unique crosslinks by peptide: 300.0 Number of unique crosslinks by protein: 298.0 Number of intra crosslinks: 279.0 Number of inter crosslinks: 21.0 Number of target-target crosslinks: 265.0 Number of target-decoy crosslinks: 0.0 Number of decoy-decoy crosslinks: 35.0 Minimum crosslink score: 1.11 Maximum crosslink score: 452.99
- pyXLMS.transform.targets_only(
- data: List[Dict[str, Any]] | Dict[str, Any],
Get target crosslinks or crosslink-spectrum-matches.
Get target crosslinks or crosslink-spectrum-matches from a list of target and decoy crosslinks or crosslink-spectrum-matches, or a parser_result. This effectively filters out any target-decoy and decoy-decoy matches and is essentially a convenience wrapper for
transform.filter_target_decoy()["Target-Target"]
.- Parameters:
data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks, or a parser_result.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of target crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with target crosslink-spectrum-matches and/or target crosslinks will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
RuntimeError – If no target crosslinks or crosslink-spectrum-matches were found.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> targets = targets_only(result["crosslink-spectrum-matches"]) >>> len(targets) 786
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> targets = targets_only(result["crosslinks"]) >>> len(targets) 265
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> result = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS") >>> result_targets = targets_only(result) >>> len(result_targets["crosslink-spectrum-matches"]) 786 >>> len(result_targets["crosslinks"]) 265
- pyXLMS.transform.to_dataframe(
- data: List[Dict[str, Any]],
Returns a pandas DataFrame of the given crosslinks or crosslink-spectrum-matches.
- Parameters:
data (list) – A list of crosslinks or crosslink-spectrum-matches as created by
data.create_crosslink()
ordata.create_csm()
.- Returns:
The pandas DataFrame created from the list of input crosslinks or crosslink-spectrum-matches. A full specification of the returned DataFrame can be found in the docs.
- Return type:
pandas.DataFrame
- Raises:
TypeError – If the list does not contain crosslinks or crosslink-spectrum-matches.
ValueError – If the list does not contain any objects.
Examples
>>> from pyXLMS.transform import to_dataframe >>> # assume that crosslinks is a list of crosslinks created by data.create_crosslink() >>> crosslink_dataframe = to_dataframe(crosslinks) >>> # assume csms is a list of crosslink-spectrum-matches created by data.create_csm() >>> csm_dataframe = to_dataframe(csms)
- pyXLMS.transform.to_proforma(
- data: Dict[str, Any] | List[Dict[str, Any]],
- crosslinker: str | float | None = None,
Returns the Proforma string for a single crosslink or crosslink-spectrum-match, or for a list of crosslinks or crosslink-spectrum-matches.
- Parameters:
data (dict of str, any, or list of dict of str, any) – A pyXLMS crosslink object, e.g. see
data.create_crosslink()
. Or a pyXLMS crosslink-spectrum-match object, e.g. seedata.create_csm()
. Alternatively, a list of crosslinks or crosslink-spectrum-matches can also be provided.crosslinker (str, or float, or None, default = None) – Optional name or mass of the crosslink reagent. If the name is given, it should be a valid name from XLMOD. If the crosslink modification is contained in the crosslink-spectrum-match object this parameter has no effect.
- Returns:
The Proforma string of the crosslink or crosslink-spectrum-match. If a list was provided a list containing all Proforma strings is returned.
- Return type:
str
- Raises:
TypeError – If an unsupported data type is provided.
Notes
Modifications with unknown mass are skipped.
If no modifications are given, only the crosslink modification will be encoded in the Proforma.
If no modifications are given and no crosslinker is given, the unmodified peptide Proforma will be returned.
Examples
>>> from pyXLMS.data import create_crosslink_min >>> from pyXLMS.transform import to_proforma >>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1) >>> to_proforma(xl) 'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_crosslink_min >>> from pyXLMS.transform import to_proforma >>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1) >>> to_proforma(xl, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1) >>> to_proforma(csm) 'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)}) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)}, charge=3) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE/3'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3) >>> to_proforma(csm) 'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'
- pyXLMS.transform.unique(
- data: List[Dict[str, Any]] | Dict[str, Any],
- by: Literal['peptide', 'protein'] = 'peptide',
- score: Literal['higher_better', 'lower_better'] = 'higher_better',
Filter for unique crosslinks or crosslink-spectrum-matches.
Filters for unique crosslinks from a list on non-unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if
by = "peptide"
, otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslinks without scores are provided, the first crosslink in the list is kept instead.or
Filters for unique crosslink-spectrum-matches from a list on non-unique crosslink-spectrum-matches. A crosslink- spectrum-match is considered unique if there is no other crosslink-spectrum-match from the same spectrum file and with the same scan number. If more than one crosslink-spectrum-match exists per spectrum file and scan number, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the first crosslink-spectrum-match in the list is kept instead.
- Parameters:
data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to filter, or a parser_result.
by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. Only affects filtering for unique crosslinks and not crosslink-spectrum-matches. If protein crosslink position is not available for all crosslinks a
ValueError
will be raised. Make sure that all crosslinks have the_proteins
and_proteins_crosslink_positions
fields set. If this is not already done by the parser, this can be achieved withtransform.reannotate_positions()
.score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of unique crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with unique crosslink-spectrum-matches and/or unique crosslinks will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.
TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.
ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import unique >>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> len(pr["crosslinks"]) 10 >>> unique_peptide = unique(pr, by="peptide") >>> len(unique_peptide["crosslink-spectrum-matches"]) 5 >>> len(unique_peptide["crosslinks"]) 3
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import unique >>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> len(pr["crosslinks"]) 10 >>> unique_protein = unique(pr, by="protein") >>> len(unique_protein["crosslink-spectrum-matches"]) 5 >>> len(unique_protein["crosslinks"]) 2
- pyXLMS.transform.validate(
- data: List[Dict[str, Any]] | Dict[str, Any],
- fdr: float = 0.01,
- formula: Literal['D/T', '(TD+DD)/TT', '(TD-DD)/TT'] = 'D/T',
- score: Literal['higher_better', 'lower_better'] = 'higher_better',
- separate_intra_inter: bool = False,
- ignore_missing_labels: bool = False,
Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate.
Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate (FDR) using the defined formula. Requires that “score”, “alpha_decoy” and “beta_decoy” fields are set for crosslinks and crosslink-spectrum-matches.
- Parameters:
data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to validate, or a parser_result.
fdr (float, default = 0.01) – The target FDR, must be given as a real number between 0 and 1. The default of 0.01 corresponds to 1% FDR.
formula (str, one of "D/T", "(TD+DD)/TT", or "(TD-DD)/TT", default = "D/T") – Which formula to use to estimate FDR. D and DD denote decoy matches, T and TT denote target matches, and TD denotes target-decoy and decoy-target matches.
score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.
separate_intra_inter (bool, default = False) – If FDR should be estimated separately for intra and inter matches.
ignore_missing_labels (bool, default = False) – If crosslinks and crosslink-spectrum-matches should be ignored if they don’t have target and decoy labels. By default and error is thrown if any unlabelled data is encountered.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of validated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an parser_result with validated crosslink-spectrum-matches and/or validated crosslinks will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If parameter formula is not one of ‘D/T’, ‘(TD+DD)/TT’, or ‘(TD-DD)/TT’.
TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.
ValueError – If parameter fdr is outside of the supported range.
ValueError – If attribute ‘score’ is not available for any of the data.
ValueError – If attribute ‘alpha_decoy’ or ‘beta_decoy’ is not available for any of the data and parameter ignore_missing_labels is set to False.
ValueError – If the number of DD matches exceeds the number of TD matches for formula ‘(TD-DD)/TT’. FDR can not be estimated with the formula ‘(TD-DD)/TT’ in these cases.
Notes
Please note that progress bars will usually not complete when running this function. This is by design as it is not necessary to iterate over all scores to estimate FDR.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import validate >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> csms = pr["crosslink-spectrum-matches"] >>> len(csms) 826 >>> validated = validate(csms) >>> len(validated) 705
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import validate >>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 826 >>> len(pr["crosslinks"]) 300 >>> validated = validate(pr) >>> len(validated["crosslink-spectrum-matches"]) 705 >> len(validated["crosslinks"]) 226
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import validate >>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 826 >>> len(pr["crosslinks"]) 300 >>> validated = validate(pr, fdr=0.05) >>> len(validated["crosslink-spectrum-matches"]) 825 >> len(validated["crosslinks"]) 260