pyXLMS.parser package#

Submodules#

pyXLMS.parser.parser_xldbse_custom module#

pyXLMS.parser.parser_xldbse_custom.pyxlms_modification_str_parser(
modifications: str,
) Dict[int, Tuple[str, float]][source]#

Parse a pyXLMS modification string.

Parses a pyXLMS modification string and returns the pyXLMS specific modification object, a dictionary that maps positions to their modififications.

Parameters:

modifications (str) – The pyXLMS modification string.

Returns:

The pyXLMS specific modification object, a dictionary that maps positions (1-based) to their respective modifications given as tuples of modification name and modification delta mass.

Return type:

dict of int, tuple

Raises:

RuntimeError – If multiple modifications on the same residue are parsed.

Examples

>>> from pyXLMS.parser import pyxlms_modification_str_parser
>>> modification_str = "(1:[DSS|138.06808])"
>>> pyxlms_modification_str_parser(modification_str)
{1: ('DSS', 138.06808)}
>>> from pyXLMS.parser import pyxlms_modification_str_parser
>>> modification_str = "(1:[DSS|138.06808]);(7:[Oxidation|15.994915])"
>>> pyxlms_modification_str_parser(modification_str)
{1: ('DSS', 138.06808), 7: ('Oxidation', 15.994915)}
pyXLMS.parser.parser_xldbse_custom.read_custom(
files: str | List[str] | BinaryIO,
column_mapping: Dict[str, str] | None = None,
parse_modifications: bool = True,
modification_parser: Callable[[str], Dict[int, Tuple[str, float]]] | None = None,
decoy_prefix: str = 'REV_',
format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx'] = 'auto',
sep: str = ',',
decimal: str = '.',
) Dict[str, Any][source]#

Read a custom or pyXLMS result file.

Reads a custom or pyXLMS crosslink-spectrum-matches result file or crosslink result file in .csv or .xlsx format, and returns a parser_result.

The minimum required columns for a crosslink-spectrum-matches result file are:

  • “Alpha Peptide”: The unmodified amino acid sequence of the first peptide.

  • “Alpha Peptide Crosslink Position”: The position of the crosslinker in the sequence of the first peptide (1-based).

  • “Beta Peptide”: The unmodified amino acid sequence of the second peptide.

  • “Beta Peptide Crosslink Position”: The position of the crosslinker in the sequence of the second peptide (1-based).

  • “Spectrum File”: Name of the spectrum file the crosslink-spectrum-match was identified in.

  • “Scan Nr”: The corresponding scan number of the crosslink-spectrum-match.

The minimum required columns for crosslink result file are:

  • “Alpha Peptide”: The unmodified amino acid sequence of the first peptide.

  • “Alpha Peptide Crosslink Position”: The position of the crosslinker in the sequence of the first peptide (1-based).

  • “Beta Peptide”: The unmodified amino acid sequence of the second peptide.

  • “Beta Peptide Crosslink Position”: The position of the crosslinker in the sequence of the second peptide (1-based).

A full specification of columns that can be parsed can be found in the docs.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the result file(s) or a file-like object/stream.

  • column_mapping (dict of str, str) – A dictionary that maps the result file columns to the required pyXLMS column names.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modification_parser’ parameter.

  • modification_parser (callable, or None) – A function that parses modification strings and returns the pyXLMS specific modifications object. If None, the function pyxlms_modification_str_parser() is used. If no modification columns are given this parameter is ignored.

  • decoy_prefix (str, default = "REV_") – The prefix that indicates that a protein is from the decoy database.

  • format ("auto", "csv", "tsv", "txt", or "xlsx", default = "auto") – The format of the result file. "auto" is only available if the name/path to the result file is given.

  • sep (str, default = ",") – Seperator used in the .csv or .tsv file. Parameter is ignored if the file is in .xlsx format.

  • decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in .xlsx format.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • ValueError – If the input format is not supported or cannot be inferred.

  • TypeError – If one of the values could not be parsed.

  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

Examples

>>> from pyXLMS.parser import read_custom
>>> csms_from_pyxlms = read_custom("data/pyxlms/csm.txt")
>>> from pyXLMS.parser import read_custom
>>> crosslinks_from_pyxlms = read_custom("data/pyxlms/xl.txt")

pyXLMS.parser.parser_xldbse_maxquant module#

pyXLMS.parser.parser_xldbse_maxquant.parse_modifications_from_maxquant_sequence(
seq: str,
crosslink_position: int,
crosslinker: str,
crosslinker_mass: float,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
) Dict[int, Tuple[str, float]][source]#

Parse post-translational-modifications from a MaxQuant peptide sequence.

Parses post-translational-modifications (PTMs) from a MaxQuant peptide sequence, for example “_VVDELVKVM(Oxidation (M))GR_”.

Parameters:
  • seq (str) – The MaxQuant sequence string.

  • crosslink_position (int) – Position of the crosslinker in the sequence (1-based).

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float) – Monoisotopic delta mass of the crosslink modification.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

Returns:

The pyXLMS specific modifications object, a dictionary that maps positions to their corresponding modifications and their monoisotopic masses.

Return type:

dict of int, tuple

Raises:
  • RuntimeError – If the sequence could not be parsed because it is not in MaxQuant format.

  • RuntimeError – If multiple modifications on the same residue are parsed.

  • KeyError – If an unknown modification is encountered.

Examples

>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence
>>> seq = "_VVDELVKVM(Oxidation (M))GR_"
>>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808)
{2: ('DSS', 138.06808), 9: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence
>>> seq = "_VVDELVKVM(Oxidation (M))GRM(Oxidation (M))_"
>>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808)
{2: ('DSS', 138.06808), 9: ('Oxidation', 15.994915), 12: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence
>>> seq = "_M(Oxidation (M))VVDELVKVM(Oxidation (M))GRM(Oxidation (M))_"
>>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808)
{2: ('DSS', 138.06808), 1: ('Oxidation', 15.994915), 10: ('Oxidation', 15.994915), 13: ('Oxidation', 15.994915)}
pyXLMS.parser.parser_xldbse_maxquant.read_maxlynx(
files: str | List[str] | BinaryIO,
crosslinker: str,
crosslinker_mass: float | None = None,
decoy_prefix: str = 'REV__',
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
sep: str = '\t',
decimal: str = '.',
) Dict[str, Any][source]#

Read a MaxLynx result file.

Reads a MaxLynx crosslink-spectrum-matches result file “crosslinkMsms.txt” in .txt (tab delimited) format and returns a parser_result. This is an alias for the MaxQuant reader.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the MaxLynx result file(s) or a file-like object/stream.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.

  • decoy_prefix (str, default = "REV__") – The prefix that indicates that a protein is from the decoy database.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • sep (str, default = "t") – Seperator used in the .txt file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • KeyError – If the specified crosslinker could not be found/mapped.

Warning

MaxLynx/MaxQuant only reports a single protein crosslink position per peptide, for ambiguous peptides only the crosslink position of the first matching protein is reported. All matching proteins can be retrieved via additional_information, however not their corresponding crosslink positions. For this reason it is recommended to use transform.reannotate_positions() to correctly annotate all crosslink positions for all peptides if that is important for downstream analysis.

Examples

>>> from pyXLMS.parser import read_maxlynx
>>> csms_from_xlsx = read_maxlynx("data/maxquant/run1/crosslinkMsms.txt")
pyXLMS.parser.parser_xldbse_maxquant.read_maxquant(
files: str | List[str] | BinaryIO,
crosslinker: str,
crosslinker_mass: float | None = None,
decoy_prefix: str = 'REV__',
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
sep: str = '\t',
decimal: str = '.',
) Dict[str, Any][source]#

Read a MaxQuant result file.

Reads a MaxQuant crosslink-spectrum-matches result file “crosslinkMsms.txt” in .txt (tab delimited) format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the MaxQuant result file(s) or a file-like object/stream.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.

  • decoy_prefix (str, default = "REV__") – The prefix that indicates that a protein is from the decoy database.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • sep (str, default = "t") – Seperator used in the .txt file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • KeyError – If the specified crosslinker could not be found/mapped.

Warning

MaxLynx/MaxQuant only reports a single protein crosslink position per peptide, for ambiguous peptides only the crosslink position of the first matching protein is reported. All matching proteins can be retrieved via additional_information, however not their corresponding crosslink positions. For this reason it is recommended to use transform.reannotate_positions() to correctly annotate all crosslink positions for all peptides if that is important for downstream analysis.

Examples

>>> from pyXLMS.parser import read_maxquant
>>> csms = read_maxquant("data/maxquant/run1/crosslinkMsms.txt")

pyXLMS.parser.parser_xldbse_merox module#

pyXLMS.parser.parser_xldbse_merox.read_merox(
files: str | List[str] | BinaryIO,
crosslinker: str,
crosslinker_mass: float | None = None,
decoy_prefix: str = 'REV__',
parse_modifications: bool = True,
modifications: Dict[str, Dict[str, Any]] = {'B': {'Amino Acid': 'C', 'Modification': ('Carbamidomethyl', 57.021464)}, 'm': {'Amino Acid': 'M', 'Modification': ('Oxidation', 15.994915)}},
sep: str = ';',
decimal: str = '.',
) Dict[str, Any][source]#

Read a MeroX result file.

Reads a MeroX crosslink-spectrum-matches result file in .csv or .zhrm format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the MeroX result file(s) or a file-like object/stream.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in constants.MODIFICATIONS this can be omitted.

  • decoy_prefix (str, default = "REV__") – The prefix that indicates that a protein is from the decoy database.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, dict of str, any, default = constants.MEROX_MODIFICATION_MAPPING) – Mapping of modification symbols to their amino acids and modifications. Please refer to constants.MEROX_MODIFICATION_MAPPING for examples.

  • sep (str, default = ";") – Seperator used in the .csv or .zhrm file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • KeyError – If the specified crosslinker could not be found/mapped.

Warning

MeroX only reports a single protein crosslink position per peptide, for ambiguous peptides only the crosslink position of the first matching protein is reported. All matching proteins can be retrieved via additional_information, however not their corresponding crosslink positions. For this reason it is recommended to use transform.reannotate_positions() to correctly annotate all crosslink positions for all peptides if that is important for downstream analysis. Additionally, please note that target and decoy information is derived based off the protein accession and parameter decoy_prefix. By default, MeroX only reports target matches that are above the desired FDR.

Examples

>>> from pyXLMS.parser import read_merox
>>> csms_from_csv = read_merox("data/merox/XLpeplib_Beveridge_QEx-HFX_DSS_R1.csv", crosslinker="DSS")
>>> from pyXLMS.parser import read_merox
>>> csms_from_zhrm = read_merox("data/merox/XLpeplib_Beveridge_QEx-HFX_DSS_R1.zhrm", crosslinker="DSS")

pyXLMS.parser.parser_xldbse_msannika module#

pyXLMS.parser.parser_xldbse_msannika.read_msannika(
files: str | List[str] | BinaryIO,
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx', 'pdresult'] = 'auto',
sep: str = '\t',
decimal: str = '.',
unsafe: bool = False,
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read an MS Annika result file.

Reads an MS Annika crosslink-spectrum-matches result file or crosslink result file in .csv or .xlsx format, or both from a .pdResult file from Proteome Discover, and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the MS Annika result file(s) or a file-like object/stream.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • format ("auto", "csv", "tsv", "txt", "xlsx", or "pdresult", default = "auto") – The format of the result file. "auto" is only available if the name/path to the MS Annika result file is given.

  • sep (str, default = "t") – Seperator used in the .csv or .tsv file. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • unsafe (bool, default = False) – If True, allows reading of negative peptide and crosslink positions but replaces their values with None. Negative values occur when peptides can’t be matched to proteins because of ‘X’ in protein sequences. Reannotation might be possible with transform.reannotate_positions().

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • ValueError – If the input format is not supported or cannot be inferred.

  • TypeError – If the pdResult file is provided in the wrong format.

  • TypeError – If parameter verbose was not set correctly.

  • RuntimeError – If one of the crosslinks or crosslink-spectrum-matches contains unknown crosslink or peptide positions. This occurs when peptides can’t be matched to proteins because of ‘X’ in protein sequences. Selecting ‘unsafe = True’ will ignore these errors and return None type positions. Reannotation might be possible with transform.reannotate_positions().

  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • KeyError – If one of the found post-translational-modifications could not be found/mapped.

Warning

MS Annika does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation. This also only applies to crosslinks and not crosslink-spectrum-matches, where this information is correctly reported and parsed.

Examples

>>> from pyXLMS.parser import read_msannika
>>> csms_from_xlsx = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx")
>>> from pyXLMS.parser import read_msannika
>>> crosslinks_from_xlsx = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx")
>>> from pyXLMS.parser import read_msannika
>>> csms_from_tsv = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.txt")
>>> from pyXLMS.parser import read_msannika
>>> crosslinks_from_tsv = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.txt")
>>> from pyXLMS.parser import read_msannika
>>> csms_and_crosslinks_from_pdresult = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1.pdResult")

pyXLMS.parser.parser_xldbse_mzid module#

pyXLMS.parser.parser_xldbse_mzid.parse_scan_nr_from_mzid(spectrum_id: str) int[source]#

Parse the scan number from a ‘spectrumID’ of a mzIdentML file.

Parameters:

title (str) – The ‘spectrumID’ of the mass spectrum from an mzIdentML file read with pyteomics.

Returns:

The scan number.

Return type:

int

Examples

>>> from pyXLMS.parser import parse_scan_nr_from_mzid
>>> parse_scan_nr_from_mzid("scan=5321")
5321
pyXLMS.parser.parser_xldbse_mzid.read_mzid(
files: str | List[str] | BinaryIO,
scan_nr_parser: Callable[[str], int] | None = None,
decoy: bool | None = None,
crosslinkers: Dict[str, float] = {'ADH': 138.09054635, 'BS3': 138.06808, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'PhoX': 209.97181},
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read a mzIdentML (mzid) file.

Reads crosslink-spectrum-matches from a mzIdentML (mzid) file and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the mzIdentML (mzid) file(s) or a file-like object/stream.

  • scan_nr_parser (callable, or None, default = None) – A function that parses the scan number from mzid spectrumIDs. If None (default) the function parse_scan_nr_from_mzid() is used.

  • decoy (bool, or None, default = None) – Whether the mzid file contains decoy CSMs (True) or target CSMs (False).

  • crosslinkers (dict of str, float, default = constants.CROSSLINKERS) – Mapping of crosslinker names to crosslinker delta masses.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • RuntimeError – If parser is used with verbose = 2.

  • RuntimeError – If there are warnings while reading the mzIdentML file (only for verbose = 2).

  • TypeError – If parameter verbose was not set correctly.

  • TypeError – If one of the values necessary to create a crosslink-spectrum-match could not be parsed correctly.

Notes

This parser is experimental, as I don’t know if the mzIdentML structure is consistent accross different crosslink search engines. This parser was tested with mzIdentML files from MS Annika and XlinkX.

Warning

This parser only parses minimal data because most information is not available from the mzIdentML file. The available data is:

  • alpha_peptide

  • alpha_peptide_crosslink_position

  • beta_peptide

  • beta_peptide_crosslink_position

  • spectrum_file

  • scan_nr

Examples

>>> from pyXLMS.parser import read_mzid
>>> csms = read_mzid("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid")

pyXLMS.parser.parser_xldbse_scout module#

pyXLMS.parser.parser_xldbse_scout.detect_scout_filetype(
data: DataFrame,
) Literal['scout_csms_unfiltered', 'scout_csms_filtered', 'scout_xl'][source]#

Detects the Scout-related source of the data.

Detects whether the input data is unfiltered crosslink-spectrum-matches, filtered crosslink-spectrum-matches, or crosslinks from Scout.

Parameters:

data (pd.DataFrame) – The input data originating from Scout.

Returns:

“scout_csms_unfiltered” if a Scout unfiltered CSMs file was read, “scout_csms_filtered” if a Scout filtered CSMs file was read, “scout_xl” if a Scout crosslink/residue pair result file was read.

Return type:

str

Raises:

ValueError – If the data source could not be determined.

Examples

>>> from pyXLMS.parser import detect_scout_filetype
>>> import pandas as pd
>>> df1 = pd.read_csv("data/scout/Cas9_Unfiltered_CSMs.csv")
>>> detect_scout_filetype(df1)
'scout_csms_unfiltered'
>>> from pyXLMS.parser import detect_scout_filetype
>>> import pandas as pd
>>> df2 = pd.read_csv("data/scout/Cas9_Filtered_CSMs.csv")
>>> detect_scout_filetype(df2)
'scout_csms_filtered'
>>> from pyXLMS.parser import detect_scout_filetype
>>> import pandas as pd
>>> df3 = pd.read_csv("data/scout/Cas9_Residue_Pairs.csv")
>>> detect_scout_filetype(df3)
'scout_xl'
pyXLMS.parser.parser_xldbse_scout.parse_modifications_from_scout_sequence(
seq: str,
crosslink_position: int,
crosslinker: str,
crosslinker_mass: float,
modifications: Dict[str, Tuple[str, float]] = {'+15.994900': ('Oxidation', 15.994915), '+57.021460': ('Carbamidomethyl', 57.021464), 'ADH': ('ADH', 138.09054635), 'BS3': ('BS3', 138.06808), 'Carbamidomethyl': ('Carbamidomethyl', 57.021464), 'DSBSO': ('DSBSO', 308.03883), 'DSBU': ('DSBU', 196.08479231), 'DSS': ('DSS', 138.06808), 'DSSO': ('DSSO', 158.00376), 'Oxidation of Methionine': ('Oxidation', 15.994915), 'PhoX': ('PhoX', 209.97181)},
verbose: Literal[0, 1, 2] = 1,
) Dict[int, Tuple[str, float]][source]#

Parse post-translational-modifications from a Scout peptide sequence.

Parses post-translational-modifications (PTMs) from a Scout peptide sequence, for example “M(+15.994900)LASAGELQKGNELALPSK”.

Parameters:
  • seq (str) – The Scout sequence string.

  • crosslink_position (int) – Position of the crosslinker in the sequence (1-based).

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float) – Monoisotopic delta mass of the crosslink modification.

  • modifications (dict of str, float, default = constants.SCOUT_MODIFICATION_MAPPING) – Mapping of modification names to modification masses.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The pyXLMS specific modifications object, a dictionary that maps positions to their corresponding modifications and their monoisotopic masses.

Return type:

dict of int, tuple

Raises:
  • RuntimeError – If multiple modifications on the same residue are parsed (only if verbose = 2).

  • KeyError – If an unknown modification is encountered.

Examples

>>> from pyXLMS.parser import parse_modifications_from_scout_sequence
>>> seq = "M(+15.994900)LASAGELQKGNELALPSK"
>>> parse_modifications_from_scout_sequence(seq, 10, "DSS", 138.06808)
{10: ('DSS', 138.06808), 1: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_scout_sequence
>>> seq = "KIEC(+57.021460)FDSVEISGVEDR"
>>> parse_modifications_from_scout_sequence(seq, 1, "DSS", 138.06808)
{1: ('DSS', 138.06808), 4: ('Carbamidomethyl', 57.021464)}
pyXLMS.parser.parser_xldbse_scout.read_scout(
files: str | List[str] | BinaryIO,
crosslinker: str,
crosslinker_mass: float | None = None,
parse_modifications: bool = True,
modifications: Dict[str, Tuple[str, float]] = {'+15.994900': ('Oxidation', 15.994915), '+57.021460': ('Carbamidomethyl', 57.021464), 'ADH': ('ADH', 138.09054635), 'BS3': ('BS3', 138.06808), 'Carbamidomethyl': ('Carbamidomethyl', 57.021464), 'DSBSO': ('DSBSO', 308.03883), 'DSBU': ('DSBU', 196.08479231), 'DSS': ('DSS', 138.06808), 'DSSO': ('DSSO', 158.00376), 'Oxidation of Methionine': ('Oxidation', 15.994915), 'PhoX': ('PhoX', 209.97181)},
sep: str = ',',
decimal: str = '.',
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read a Scout result file.

Reads a Scout filtered or unfiltered crosslink-spectrum-matches result file or crosslink/residue pair result file in .csv format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the Scout result file(s) or a file-like object/stream.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, tuple, default = constants.SCOUT_MODIFICATION_MAPPING) – Mapping of Scout sequence elements (e.g. "+15.994900") and modifications (e.g "Oxidation of Methionine") to their modifications (e.g. ("Oxidation", 15.994915)).

  • sep (str, default = ",") – Seperator used in the .csv file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • KeyError – If the specified crosslinker could not be found/mapped.

  • TypeError – If parameter verbose was not set correctly.

Warning

  • When reading unfiltered crosslink-spectrum-matches, no protein crosslink positions or protein peptide positions are available, as these are not reported. If needed they should be annotated with transform.reannotate_positions().

  • When reading filtered crosslink-spectrum-matches, Scout does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink-spectrum-match are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink-spectrum-match. This leads to only TT and DD matches, which needs to be considered for FDR estimation.

  • When reading crosslinks / residue pairs, Scout does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation.

Examples

>>> from pyXLMS.parser import read_scout
>>> csms_unfiltered = read_scout("data/scout/Cas9_Unfiltered_CSMs.csv")
>>> from pyXLMS.parser import read_scout
>>> csms_filtered = read_scout("data/scout/Cas9_Filtered_CSMs.csv")
>>> from pyXLMS.parser import read_scout
>>> crosslinks = read_scout("data/scout/Cas9_Residue_Pairs.csv")

pyXLMS.parser.parser_xldbse_xi module#

pyXLMS.parser.parser_xldbse_xi.detect_xi_filetype(
data: DataFrame,
) Literal['xisearch', 'xifdr_csms', 'xifdr_crosslinks'][source]#

Detects the xi-related source (application) of the data.

Detects whether the input data is originating from xiSearch or xiFDR, and if xiFDR which type of data is being read (crosslink-spectrum-matches or crosslinks).

Parameters:

data (pd.DataFrame) – The input data originating from xiSearch or xiFDR.

Returns:

“xisearch” if a xiSearch result file was read, “xifdr_csms” if CSMs from xiFDR were read, “xifdr_crosslinks” if crosslinks from xiFDR were read.

Return type:

str

Raises:

ValueError – If the data source could not be determined.

Examples

>>> from pyXLMS.parser import detect_xi_filetype
>>> import pandas as pd
>>> df1 = pd.read_csv("data/xi/r1_Xi1.7.6.7.csv")
>>> detect_xi_filetype(df1)
'xisearch'
>>> from pyXLMS.parser import detect_xi_filetype
>>> import pandas as pd
>>> df2 = pd.read_csv("data/xi/1perc_xl_boost_CSM_xiFDR2.2.1.csv")
>>> detect_xi_filetype(df2)
'xifdr_csms'
>>> from pyXLMS.parser import detect_xi_filetype
>>> import pandas as pd
>>> df3 = pd.read_csv("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv")
>>> detect_xi_filetype(df3)
'xifdr_crosslinks'
pyXLMS.parser.parser_xldbse_xi.parse_modifications_from_xi_sequence(sequence: str) Dict[int, str][source]#

Parses all post-translational-modifications from a peptide sequence as reported by xiFDR.

Parses all post-translational-modifications from a peptide sequence as reported by xiFDR. This assumes that amino acids are given in upper case letters and post-translational-modifications in lower case letters. The parsed modifications are returned as a dictionary that maps their position in the sequence (1-based) to their xiFDR annotation (SYMBOLEXT), for example "cm" or "ox".

Parameters:

sequence (str) – The peptide sequence as given by xiFDR.

Returns:

Dictionary that maps modifications (values) to their respective positions in the peptide sequence (1-based) (keys). The modifications are given in xiFDR annotation style (SYMBOLEXT) which is the lower letter modification code, for example "cm" for carbamidomethylation.

Return type:

dict of int, str

Raises:

RuntimeError – If multiple modifications on the same residue are parsed.

Examples

>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq1 = "KIECcmFDSVEISGVEDR"
>>> parse_modifications_from_xi_sequence(seq1)
{4: 'cm'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq2 = "KIECcmFDSVEMoxISGVEDR"
>>> parse_modifications_from_xi_sequence(seq2)
{4: 'cm', 10: 'ox'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq3 = "KIECcmFDSVEISGVEDRMox"
>>> parse_modifications_from_xi_sequence(seq3)
{4: 'cm', 17: 'ox'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq4 = "CcmKIECcmFDSVEISGVEDRMox"
>>> parse_modifications_from_xi_sequence(seq4)
{1: 'cm', 5: 'cm', 18: 'ox'}
pyXLMS.parser.parser_xldbse_xi.parse_peptide(sequence: str, term_char: str = '.') str[source]#

Parses the peptide sequence from a sequence string including flanking amino acids.

Parses the peptide sequence from a sequence string including flanking amino acids, for example "K.KKMoxKLS.S". The returned peptide sequence for this example would be "KKMoxKLS".

Parameters:
  • sequence (str) – The sequence string containing the peptide sequence and flanking amino acids.

  • term_char (str (single character), default = ".") – The character used to denote N-terminal and C-terminal.

Returns:

The parsed peptide sequence without flanking amino acids.

Return type:

str

Raises:

RuntimeError – If (one of) the peptide sequence(s) could not be parsed.

Examples

>>> from pyXLMS.parser import parse_peptide
>>> parse_peptide("K.KKMoxKLS.S")
'KKMoxKLS'
>>> from pyXLMS.parser import parse_peptide
>>> parse_peptide("-.CcmCcmPSR.T")
'CcmCcmPSR'
>>> from pyXLMS.parser import parse_peptide
>>> parse_peptide("CCPSR")
'CCPSR'
pyXLMS.parser.parser_xldbse_xi.read_xi(
files: str | List[str] | BinaryIO,
decoy_prefix: str | None = 'auto',
parse_modifications: bool = True,
modifications: Dict[str, Tuple[str, float]] = {'->': ('Substitution', nan), 'bs3_ami': ('BS3 Amidated', 155.094619105), 'bs3_hyd': ('BS3 Hydrolized', 156.0786347), 'bs3_tris': ('BS3 Tris', 259.141973), 'bs3loop': ('BS3 Looplink', 138.06808), 'bs3nh2': ('BS3 Amidated', 155.094619105), 'bs3oh': ('BS3 Hydrolized', 156.0786347), 'cm': ('Carbamidomethyl', 57.021464), 'dsbu_ami': ('DSBU Amidated', 213.111341), 'dsbu_hyd': ('DSBU Hydrolized', 214.095357), 'dsbu_loop': ('DSBU Looplink', 196.08479231), 'dsbu_tris': ('DSBU Tris', 317.158685), 'dsbuloop': ('DSBU Looplink', 196.08479231), 'dsso_ami': ('DSSO Amidated', 175.030313905), 'dsso_hyd': ('DSSO Hydrolized', 176.0143295), 'dsso_loop': ('DSSO Looplink', 158.00376), 'dsso_tris': ('DSSO Tris', 279.077658), 'dssoloop': ('DSSO Looplink', 158.00376), 'ox': ('Oxidation', 15.994915)},
sep: str = ',',
decimal: str = '.',
ignore_errors: bool = False,
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read a xiSearch/xiFDR result file.

Reads a xiSearch crosslink-spectrum-matches result file or a xiFDR crosslink-spectrum-matches result file or crosslink result file in .csv format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the xiSearch/xiFDR result file(s) or a file-like object/stream.

  • decoy_prefix (str, or None, default = "auto") – The prefix that indicates that a protein is from the decoy database. If “auto” or None it will use the default for each xi file type.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, tuple, default = constants.XI_MODIFICATION_MAPPING) – Mapping of xi sequence elements (e.g. "cm") to their modifications (e.g. ("Carbamidomethyl", 57.021464)). This corresponds to the SYMBOLEXT field, or the SYMBOL field minus the amino acid in the xiSearch config.

  • sep (str, default = ",") – Seperator used in the .csv file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

  • ignore_errors (bool, default = False) – If modifications that are not given in parameter ‘modifications’ should raise an error or not. By default an error is raised if an unknown modification is encountered. If True modifications that are unknown are encoded with the xi shortcode (SYMBOLEXT) and float("nan") modification mass.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • TypeError – If parameter verbose was not set correctly.

Examples

>>> from pyXLMS.parser import read_xi
>>> csms_from_xiSearch = read_xi("data/xi/r1_Xi1.7.6.7.csv")
>>> from pyXLMS.parser import read_xi
>>> csms_from_xiFDR = read_xi("data/xi/1perc_xl_boost_CSM_xiFDR2.2.1.csv")
>>> from pyXLMS.parser import read_xi
>>> crosslinks_from_xiFDR = read_xi("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv")

pyXLMS.parser.parser_xldbse_xlinkx module#

pyXLMS.parser.parser_xldbse_xlinkx.read_xlinkx(
files: str | List[str] | BinaryIO,
decoy: bool | None = None,
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx', 'pdresult'] = 'auto',
sep: str = '\t',
decimal: str = '.',
ignore_errors: bool = False,
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read an XlinkX result file.

Reads an XlinkX crosslink-spectrum-matches result file or crosslink result file in .csv or .xlsx format, or both from a .pdResult file from Proteome Discover, and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the XlinkX result file(s) or a file-like object/stream.

  • decoy (bool, or None) – Default decoy value to use if no decoy value is found. Only used if the “Is Decoy” column is not found in the supplied data.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • format ("auto", "csv", "tsv", "txt", "xlsx", or "pdresult", default = "auto") – The format of the result file. "auto" is only available if the name/path to the XlinkX result file is given.

  • sep (str, default = "t") – Seperator used in the .csv or .tsv file. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • ignore_errors (bool, default = False) – If missing crosslink positions should raise an error or not. Setting this to True will suppress the RuntimeError for the crosslink position not being able to be parsed for at least one of the crosslinks. For these cases the crosslink position will be set to 100 000.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • ValueError – If the input format is not supported or cannot be inferred.

  • TypeError – If parameter verbose was not set correctly.

  • TypeError – If the pdResult file is provided in the wrong format.

  • RuntimeError – If the crosslink position could not be parsed for at least one of the crosslinks.

  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • KeyError – If one of the found post-translational-modifications could not be found/mapped.

Warning

XlinkX does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation. This applies to both crosslinks and crosslink-spectrum-matches.

Examples

>>> from pyXLMS.parser import read_xlinkx
>>> csms_from_xlsx = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_CSMs.xlsx")
>>> from pyXLMS.parser import read_xlinkx
>>> crosslinks_from_xlsx = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_Crosslinks.xlsx")
>>> from pyXLMS.parser import read_xlinkx
>>> csms_from_tsv = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_CSMs.txt")
>>> from pyXLMS.parser import read_xlinkx
>>> crosslinks_from_tsv = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_Crosslinks.txt")
>>> from pyXLMS.parser import read_xlinkx
>>> csms_and_crosslinks_from_pdresult = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3.pdResult")

pyXLMS.parser.util module#

pyXLMS.parser.util.format_sequence(
sequence: str,
remove_non_aa: bool = True,
remove_lower: bool = True,
) str[source]#

Formats the given amino acid sequence into common represenation.

The given amino acid sequence is re-formatted by converting all amino acids to upper case and optionally removing non-encoding and lower case characters.

Parameters:
  • sequence (str) – The amino acid sequence that should be formatted. Post-translational-modifications can be included in lower case but will be removed.

  • remove_non_aa (bool, default = True) – Whether or not to remove characters that do not encode amino acids.

  • remove_lower (bool, default = True) – Whether or not to remove lower case characters, this should be true if the amino acid sequence encodes post-translational-modifications in lower case.

Returns:

The formatted sequence.

Return type:

str

Examples

>>> from pyXLMS.parser_util import format_sequence
>>> format_sequence("PEP[K]TIDE")
'PEPKTIDE'
>>> from pyXLMS.parser_util import format_sequence
>>> format_sequence("PEPKdssoTIDE")
'PEPKTIDE'
>>> from pyXLMS.parser_util import format_sequence
>>> format_sequence("peptide", remove_lower = False)
'PEPTIDE'
pyXLMS.parser.util.get_bool_from_value(value: Any) bool[source]#

Parse a bool value from the given input.

Tries to parse a boolean value from the given input object. If the object is of instance bool it will return the object, if it is of instance int it will return True if the object is 1 or False if the object is 0, any other number will raise a ValueError. If the object is of instance str it will return True if the lower case version contains the letter t and otherwise False. If the object is none of these types a ValueError will be raised.

Parameters:

value (Any) – The value to parse from.

Returns:

The parsed boolean value.

Return type:

bool

Raises:

ValueError – If the object could not be parsed to bool.

Examples

>>> from pyXLMS.parser_util import get_bool_from_value
>>> get_bool_from_value(0)
False
>>> from pyXLMS.parser_util import get_bool_from_value
>>> get_bool_from_value("T")
True

Module contents#

Detects the pLink-related file type of the data.

Detects whether the input data is a pLink “*cross-linked_peptides.csv” file or a pLink “*cross-linked_spectra.csv” file.

Parameters:
  • file (str, or BinaryIO) – The name/path of the pLink result file or a file-like object/stream.

  • sep (str, default = ",") – Seperator used in the .csv file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

Returns:

Returns “crosslinks” if file is a “*cross-linked_peptides.csv” or “crosslink-spectrum-matches” if file is a “*cross-linked_spectra.csv”.

Return type:

str

Raises:
  • RuntimeError – If the file could not be parsed.

  • RuntimeError – If the file does not contain any data.

  • ValueError – If the file does not match any of the supported pLink input files.

Examples

>>> from pyXLMS.parser import detect_plink_filetype
>>> detect_plink_filetype("data/plink2/Cas9_plus10_2024.06.20.filtered_cross-linked_peptides.csv")
'crosslinks'
>>> from pyXLMS.parser import detect_plink_filetype
>>> detect_plink_filetype("data/plink2/Cas9_plus10_2024.06.20.filtered_cross-linked_spectra.csv")
'crosslink-spectrum-matches'
pyXLMS.parser.detect_scout_filetype(
data: DataFrame,
) Literal['scout_csms_unfiltered', 'scout_csms_filtered', 'scout_xl'][source]#

Detects the Scout-related source of the data.

Detects whether the input data is unfiltered crosslink-spectrum-matches, filtered crosslink-spectrum-matches, or crosslinks from Scout.

Parameters:

data (pd.DataFrame) – The input data originating from Scout.

Returns:

“scout_csms_unfiltered” if a Scout unfiltered CSMs file was read, “scout_csms_filtered” if a Scout filtered CSMs file was read, “scout_xl” if a Scout crosslink/residue pair result file was read.

Return type:

str

Raises:

ValueError – If the data source could not be determined.

Examples

>>> from pyXLMS.parser import detect_scout_filetype
>>> import pandas as pd
>>> df1 = pd.read_csv("data/scout/Cas9_Unfiltered_CSMs.csv")
>>> detect_scout_filetype(df1)
'scout_csms_unfiltered'
>>> from pyXLMS.parser import detect_scout_filetype
>>> import pandas as pd
>>> df2 = pd.read_csv("data/scout/Cas9_Filtered_CSMs.csv")
>>> detect_scout_filetype(df2)
'scout_csms_filtered'
>>> from pyXLMS.parser import detect_scout_filetype
>>> import pandas as pd
>>> df3 = pd.read_csv("data/scout/Cas9_Residue_Pairs.csv")
>>> detect_scout_filetype(df3)
'scout_xl'
pyXLMS.parser.detect_xi_filetype(
data: DataFrame,
) Literal['xisearch', 'xifdr_csms', 'xifdr_crosslinks'][source]#

Detects the xi-related source (application) of the data.

Detects whether the input data is originating from xiSearch or xiFDR, and if xiFDR which type of data is being read (crosslink-spectrum-matches or crosslinks).

Parameters:

data (pd.DataFrame) – The input data originating from xiSearch or xiFDR.

Returns:

“xisearch” if a xiSearch result file was read, “xifdr_csms” if CSMs from xiFDR were read, “xifdr_crosslinks” if crosslinks from xiFDR were read.

Return type:

str

Raises:

ValueError – If the data source could not be determined.

Examples

>>> from pyXLMS.parser import detect_xi_filetype
>>> import pandas as pd
>>> df1 = pd.read_csv("data/xi/r1_Xi1.7.6.7.csv")
>>> detect_xi_filetype(df1)
'xisearch'
>>> from pyXLMS.parser import detect_xi_filetype
>>> import pandas as pd
>>> df2 = pd.read_csv("data/xi/1perc_xl_boost_CSM_xiFDR2.2.1.csv")
>>> detect_xi_filetype(df2)
'xifdr_csms'
>>> from pyXLMS.parser import detect_xi_filetype
>>> import pandas as pd
>>> df3 = pd.read_csv("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv")
>>> detect_xi_filetype(df3)
'xifdr_crosslinks'
pyXLMS.parser.parse_modifications_from_maxquant_sequence(
seq: str,
crosslink_position: int,
crosslinker: str,
crosslinker_mass: float,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
) Dict[int, Tuple[str, float]][source]#

Parse post-translational-modifications from a MaxQuant peptide sequence.

Parses post-translational-modifications (PTMs) from a MaxQuant peptide sequence, for example “_VVDELVKVM(Oxidation (M))GR_”.

Parameters:
  • seq (str) – The MaxQuant sequence string.

  • crosslink_position (int) – Position of the crosslinker in the sequence (1-based).

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float) – Monoisotopic delta mass of the crosslink modification.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

Returns:

The pyXLMS specific modifications object, a dictionary that maps positions to their corresponding modifications and their monoisotopic masses.

Return type:

dict of int, tuple

Raises:
  • RuntimeError – If the sequence could not be parsed because it is not in MaxQuant format.

  • RuntimeError – If multiple modifications on the same residue are parsed.

  • KeyError – If an unknown modification is encountered.

Examples

>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence
>>> seq = "_VVDELVKVM(Oxidation (M))GR_"
>>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808)
{2: ('DSS', 138.06808), 9: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence
>>> seq = "_VVDELVKVM(Oxidation (M))GRM(Oxidation (M))_"
>>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808)
{2: ('DSS', 138.06808), 9: ('Oxidation', 15.994915), 12: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence
>>> seq = "_M(Oxidation (M))VVDELVKVM(Oxidation (M))GRM(Oxidation (M))_"
>>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808)
{2: ('DSS', 138.06808), 1: ('Oxidation', 15.994915), 10: ('Oxidation', 15.994915), 13: ('Oxidation', 15.994915)}
pyXLMS.parser.parse_modifications_from_scout_sequence(
seq: str,
crosslink_position: int,
crosslinker: str,
crosslinker_mass: float,
modifications: Dict[str, Tuple[str, float]] = {'+15.994900': ('Oxidation', 15.994915), '+57.021460': ('Carbamidomethyl', 57.021464), 'ADH': ('ADH', 138.09054635), 'BS3': ('BS3', 138.06808), 'Carbamidomethyl': ('Carbamidomethyl', 57.021464), 'DSBSO': ('DSBSO', 308.03883), 'DSBU': ('DSBU', 196.08479231), 'DSS': ('DSS', 138.06808), 'DSSO': ('DSSO', 158.00376), 'Oxidation of Methionine': ('Oxidation', 15.994915), 'PhoX': ('PhoX', 209.97181)},
verbose: Literal[0, 1, 2] = 1,
) Dict[int, Tuple[str, float]][source]#

Parse post-translational-modifications from a Scout peptide sequence.

Parses post-translational-modifications (PTMs) from a Scout peptide sequence, for example “M(+15.994900)LASAGELQKGNELALPSK”.

Parameters:
  • seq (str) – The Scout sequence string.

  • crosslink_position (int) – Position of the crosslinker in the sequence (1-based).

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float) – Monoisotopic delta mass of the crosslink modification.

  • modifications (dict of str, float, default = constants.SCOUT_MODIFICATION_MAPPING) – Mapping of modification names to modification masses.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The pyXLMS specific modifications object, a dictionary that maps positions to their corresponding modifications and their monoisotopic masses.

Return type:

dict of int, tuple

Raises:
  • RuntimeError – If multiple modifications on the same residue are parsed (only if verbose = 2).

  • KeyError – If an unknown modification is encountered.

Examples

>>> from pyXLMS.parser import parse_modifications_from_scout_sequence
>>> seq = "M(+15.994900)LASAGELQKGNELALPSK"
>>> parse_modifications_from_scout_sequence(seq, 10, "DSS", 138.06808)
{10: ('DSS', 138.06808), 1: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_scout_sequence
>>> seq = "KIEC(+57.021460)FDSVEISGVEDR"
>>> parse_modifications_from_scout_sequence(seq, 1, "DSS", 138.06808)
{1: ('DSS', 138.06808), 4: ('Carbamidomethyl', 57.021464)}
pyXLMS.parser.parse_modifications_from_xi_sequence(sequence: str) Dict[int, str][source]#

Parses all post-translational-modifications from a peptide sequence as reported by xiFDR.

Parses all post-translational-modifications from a peptide sequence as reported by xiFDR. This assumes that amino acids are given in upper case letters and post-translational-modifications in lower case letters. The parsed modifications are returned as a dictionary that maps their position in the sequence (1-based) to their xiFDR annotation (SYMBOLEXT), for example "cm" or "ox".

Parameters:

sequence (str) – The peptide sequence as given by xiFDR.

Returns:

Dictionary that maps modifications (values) to their respective positions in the peptide sequence (1-based) (keys). The modifications are given in xiFDR annotation style (SYMBOLEXT) which is the lower letter modification code, for example "cm" for carbamidomethylation.

Return type:

dict of int, str

Raises:

RuntimeError – If multiple modifications on the same residue are parsed.

Examples

>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq1 = "KIECcmFDSVEISGVEDR"
>>> parse_modifications_from_xi_sequence(seq1)
{4: 'cm'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq2 = "KIECcmFDSVEMoxISGVEDR"
>>> parse_modifications_from_xi_sequence(seq2)
{4: 'cm', 10: 'ox'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq3 = "KIECcmFDSVEISGVEDRMox"
>>> parse_modifications_from_xi_sequence(seq3)
{4: 'cm', 17: 'ox'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq4 = "CcmKIECcmFDSVEISGVEDRMox"
>>> parse_modifications_from_xi_sequence(seq4)
{1: 'cm', 5: 'cm', 18: 'ox'}
pyXLMS.parser.parse_peptide(sequence: str, term_char: str = '.') str[source]#

Parses the peptide sequence from a sequence string including flanking amino acids.

Parses the peptide sequence from a sequence string including flanking amino acids, for example "K.KKMoxKLS.S". The returned peptide sequence for this example would be "KKMoxKLS".

Parameters:
  • sequence (str) – The sequence string containing the peptide sequence and flanking amino acids.

  • term_char (str (single character), default = ".") – The character used to denote N-terminal and C-terminal.

Returns:

The parsed peptide sequence without flanking amino acids.

Return type:

str

Raises:

RuntimeError – If (one of) the peptide sequence(s) could not be parsed.

Examples

>>> from pyXLMS.parser import parse_peptide
>>> parse_peptide("K.KKMoxKLS.S")
'KKMoxKLS'
>>> from pyXLMS.parser import parse_peptide
>>> parse_peptide("-.CcmCcmPSR.T")
'CcmCcmPSR'
>>> from pyXLMS.parser import parse_peptide
>>> parse_peptide("CCPSR")
'CCPSR'
pyXLMS.parser.parse_scan_nr_from_mzid(spectrum_id: str) int[source]#

Parse the scan number from a ‘spectrumID’ of a mzIdentML file.

Parameters:

title (str) – The ‘spectrumID’ of the mass spectrum from an mzIdentML file read with pyteomics.

Returns:

The scan number.

Return type:

int

Examples

>>> from pyXLMS.parser import parse_scan_nr_from_mzid
>>> parse_scan_nr_from_mzid("scan=5321")
5321

Parse the scan number from a spectrum title.

Parameters:

title (str) – The spectrum title.

Returns:

The scan number.

Return type:

int

Examples

>>> from pyXLMS.parser import parse_scan_nr_from_plink
>>> parse_scan_nr_from_plink("XLpeplib_Beveridge_QEx-HFX_DSS_R1.20588.20588.3.0.dta")
20588

Parse the spectrum file name from a spectrum title.

Parameters:

title (str) – The spectrum title.

Returns:

The spectrum file name.

Return type:

str

Examples

>>> from pyXLMS.parser import parse_spectrum_file_from_plink
>>> parse_spectrum_file_from_plink("XLpeplib_Beveridge_QEx-HFX_DSS_R1.20588.20588.3.0.dta")
'XLpeplib_Beveridge_QEx-HFX_DSS_R1'
pyXLMS.parser.pyxlms_modification_str_parser(
modifications: str,
) Dict[int, Tuple[str, float]][source]#

Parse a pyXLMS modification string.

Parses a pyXLMS modification string and returns the pyXLMS specific modification object, a dictionary that maps positions to their modififications.

Parameters:

modifications (str) – The pyXLMS modification string.

Returns:

The pyXLMS specific modification object, a dictionary that maps positions (1-based) to their respective modifications given as tuples of modification name and modification delta mass.

Return type:

dict of int, tuple

Raises:

RuntimeError – If multiple modifications on the same residue are parsed.

Examples

>>> from pyXLMS.parser import pyxlms_modification_str_parser
>>> modification_str = "(1:[DSS|138.06808])"
>>> pyxlms_modification_str_parser(modification_str)
{1: ('DSS', 138.06808)}
>>> from pyXLMS.parser import pyxlms_modification_str_parser
>>> modification_str = "(1:[DSS|138.06808]);(7:[Oxidation|15.994915])"
>>> pyxlms_modification_str_parser(modification_str)
{1: ('DSS', 138.06808), 7: ('Oxidation', 15.994915)}
pyXLMS.parser.read(
files: str | List[str] | BinaryIO,
engine: Literal['Custom', 'MaxQuant', 'MaxLynx', 'MeroX', 'MS Annika', 'mzIdentML', 'pLink', 'Scout', 'xiSearch/xiFDR', 'XlinkX'],
crosslinker: str,
parse_modifications: bool = True,
ignore_errors: bool = False,
verbose: Literal[0, 1, 2] = 1,
**kwargs,
) Dict[str, Any][source]#

Read a crosslink result file.

Reads a crosslink or crosslink-spectrum-match result file from any of the supported crosslink search engines or formats. Currently supports results files from MaxLynx/MaxQuant, MeroX, MS Annika, pLink 2 and pLink 3, Scout, xiSearch and xiFDR, XlinkX, and the mzIdentML format. Additionally supports parsing from custom .csv files in pyXLMS format, see more about the custom format in parser.read_custom() and in here: docs.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the result file(s) or a file-like object/stream.

  • engine ("Custom", "MaxQuant", "MaxLynx", "MeroX", "MS Annika", "mzIdentML", "pLink", "Scout", "xiSearch/xiFDR", or "XlinkX") – Crosslink search engine or format of the result file.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter for every parser. Defaults are selected for every parser if ‘modifications’ is not passed via **kwargs.

  • ignore_errors (bool, default = False) – Ignore errors when mapping modifications. Used in parser.read_xi() and parser.read_xlinkx().

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

  • **kwargs – Any additional parameters will be passed to the specific parsers.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:

ValueError – If the value entered for parameter engine is not supported.

Examples

>>> from pyXLMS.parser import read
>>> csms_from_xiSearch = read("data/xi/r1_Xi1.7.6.7.csv", engine="xiSearch/xiFDR", crosslinker="DSS")
>>> from pyXLMS.parser import read
>>> csms_from_MaxQuant = read("data/maxquant/run1/crosslinkMsms.txt", engine="MaxQuant", crosslinker="DSS")
pyXLMS.parser.read_custom(
files: str | List[str] | BinaryIO,
column_mapping: Dict[str, str] | None = None,
parse_modifications: bool = True,
modification_parser: Callable[[str], Dict[int, Tuple[str, float]]] | None = None,
decoy_prefix: str = 'REV_',
format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx'] = 'auto',
sep: str = ',',
decimal: str = '.',
) Dict[str, Any][source]#

Read a custom or pyXLMS result file.

Reads a custom or pyXLMS crosslink-spectrum-matches result file or crosslink result file in .csv or .xlsx format, and returns a parser_result.

The minimum required columns for a crosslink-spectrum-matches result file are:

  • “Alpha Peptide”: The unmodified amino acid sequence of the first peptide.

  • “Alpha Peptide Crosslink Position”: The position of the crosslinker in the sequence of the first peptide (1-based).

  • “Beta Peptide”: The unmodified amino acid sequence of the second peptide.

  • “Beta Peptide Crosslink Position”: The position of the crosslinker in the sequence of the second peptide (1-based).

  • “Spectrum File”: Name of the spectrum file the crosslink-spectrum-match was identified in.

  • “Scan Nr”: The corresponding scan number of the crosslink-spectrum-match.

The minimum required columns for crosslink result file are:

  • “Alpha Peptide”: The unmodified amino acid sequence of the first peptide.

  • “Alpha Peptide Crosslink Position”: The position of the crosslinker in the sequence of the first peptide (1-based).

  • “Beta Peptide”: The unmodified amino acid sequence of the second peptide.

  • “Beta Peptide Crosslink Position”: The position of the crosslinker in the sequence of the second peptide (1-based).

A full specification of columns that can be parsed can be found in the docs.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the result file(s) or a file-like object/stream.

  • column_mapping (dict of str, str) – A dictionary that maps the result file columns to the required pyXLMS column names.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modification_parser’ parameter.

  • modification_parser (callable, or None) – A function that parses modification strings and returns the pyXLMS specific modifications object. If None, the function pyxlms_modification_str_parser() is used. If no modification columns are given this parameter is ignored.

  • decoy_prefix (str, default = "REV_") – The prefix that indicates that a protein is from the decoy database.

  • format ("auto", "csv", "tsv", "txt", or "xlsx", default = "auto") – The format of the result file. "auto" is only available if the name/path to the result file is given.

  • sep (str, default = ",") – Seperator used in the .csv or .tsv file. Parameter is ignored if the file is in .xlsx format.

  • decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in .xlsx format.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • ValueError – If the input format is not supported or cannot be inferred.

  • TypeError – If one of the values could not be parsed.

  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

Examples

>>> from pyXLMS.parser import read_custom
>>> csms_from_pyxlms = read_custom("data/pyxlms/csm.txt")
>>> from pyXLMS.parser import read_custom
>>> crosslinks_from_pyxlms = read_custom("data/pyxlms/xl.txt")
pyXLMS.parser.read_maxlynx(
files: str | List[str] | BinaryIO,
crosslinker: str,
crosslinker_mass: float | None = None,
decoy_prefix: str = 'REV__',
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
sep: str = '\t',
decimal: str = '.',
) Dict[str, Any][source]#

Read a MaxLynx result file.

Reads a MaxLynx crosslink-spectrum-matches result file “crosslinkMsms.txt” in .txt (tab delimited) format and returns a parser_result. This is an alias for the MaxQuant reader.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the MaxLynx result file(s) or a file-like object/stream.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.

  • decoy_prefix (str, default = "REV__") – The prefix that indicates that a protein is from the decoy database.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • sep (str, default = "t") – Seperator used in the .txt file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • KeyError – If the specified crosslinker could not be found/mapped.

Warning

MaxLynx/MaxQuant only reports a single protein crosslink position per peptide, for ambiguous peptides only the crosslink position of the first matching protein is reported. All matching proteins can be retrieved via additional_information, however not their corresponding crosslink positions. For this reason it is recommended to use transform.reannotate_positions() to correctly annotate all crosslink positions for all peptides if that is important for downstream analysis.

Examples

>>> from pyXLMS.parser import read_maxlynx
>>> csms_from_xlsx = read_maxlynx("data/maxquant/run1/crosslinkMsms.txt")
pyXLMS.parser.read_maxquant(
files: str | List[str] | BinaryIO,
crosslinker: str,
crosslinker_mass: float | None = None,
decoy_prefix: str = 'REV__',
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
sep: str = '\t',
decimal: str = '.',
) Dict[str, Any][source]#

Read a MaxQuant result file.

Reads a MaxQuant crosslink-spectrum-matches result file “crosslinkMsms.txt” in .txt (tab delimited) format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the MaxQuant result file(s) or a file-like object/stream.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.

  • decoy_prefix (str, default = "REV__") – The prefix that indicates that a protein is from the decoy database.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • sep (str, default = "t") – Seperator used in the .txt file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • KeyError – If the specified crosslinker could not be found/mapped.

Warning

MaxLynx/MaxQuant only reports a single protein crosslink position per peptide, for ambiguous peptides only the crosslink position of the first matching protein is reported. All matching proteins can be retrieved via additional_information, however not their corresponding crosslink positions. For this reason it is recommended to use transform.reannotate_positions() to correctly annotate all crosslink positions for all peptides if that is important for downstream analysis.

Examples

>>> from pyXLMS.parser import read_maxquant
>>> csms = read_maxquant("data/maxquant/run1/crosslinkMsms.txt")
pyXLMS.parser.read_merox(
files: str | List[str] | BinaryIO,
crosslinker: str,
crosslinker_mass: float | None = None,
decoy_prefix: str = 'REV__',
parse_modifications: bool = True,
modifications: Dict[str, Dict[str, Any]] = {'B': {'Amino Acid': 'C', 'Modification': ('Carbamidomethyl', 57.021464)}, 'm': {'Amino Acid': 'M', 'Modification': ('Oxidation', 15.994915)}},
sep: str = ';',
decimal: str = '.',
) Dict[str, Any][source]#

Read a MeroX result file.

Reads a MeroX crosslink-spectrum-matches result file in .csv or .zhrm format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the MeroX result file(s) or a file-like object/stream.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in constants.MODIFICATIONS this can be omitted.

  • decoy_prefix (str, default = "REV__") – The prefix that indicates that a protein is from the decoy database.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, dict of str, any, default = constants.MEROX_MODIFICATION_MAPPING) – Mapping of modification symbols to their amino acids and modifications. Please refer to constants.MEROX_MODIFICATION_MAPPING for examples.

  • sep (str, default = ";") – Seperator used in the .csv or .zhrm file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • KeyError – If the specified crosslinker could not be found/mapped.

Warning

MeroX only reports a single protein crosslink position per peptide, for ambiguous peptides only the crosslink position of the first matching protein is reported. All matching proteins can be retrieved via additional_information, however not their corresponding crosslink positions. For this reason it is recommended to use transform.reannotate_positions() to correctly annotate all crosslink positions for all peptides if that is important for downstream analysis. Additionally, please note that target and decoy information is derived based off the protein accession and parameter decoy_prefix. By default, MeroX only reports target matches that are above the desired FDR.

Examples

>>> from pyXLMS.parser import read_merox
>>> csms_from_csv = read_merox("data/merox/XLpeplib_Beveridge_QEx-HFX_DSS_R1.csv", crosslinker="DSS")
>>> from pyXLMS.parser import read_merox
>>> csms_from_zhrm = read_merox("data/merox/XLpeplib_Beveridge_QEx-HFX_DSS_R1.zhrm", crosslinker="DSS")
pyXLMS.parser.read_msannika(
files: str | List[str] | BinaryIO,
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx', 'pdresult'] = 'auto',
sep: str = '\t',
decimal: str = '.',
unsafe: bool = False,
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read an MS Annika result file.

Reads an MS Annika crosslink-spectrum-matches result file or crosslink result file in .csv or .xlsx format, or both from a .pdResult file from Proteome Discover, and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the MS Annika result file(s) or a file-like object/stream.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • format ("auto", "csv", "tsv", "txt", "xlsx", or "pdresult", default = "auto") – The format of the result file. "auto" is only available if the name/path to the MS Annika result file is given.

  • sep (str, default = "t") – Seperator used in the .csv or .tsv file. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • unsafe (bool, default = False) – If True, allows reading of negative peptide and crosslink positions but replaces their values with None. Negative values occur when peptides can’t be matched to proteins because of ‘X’ in protein sequences. Reannotation might be possible with transform.reannotate_positions().

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • ValueError – If the input format is not supported or cannot be inferred.

  • TypeError – If the pdResult file is provided in the wrong format.

  • TypeError – If parameter verbose was not set correctly.

  • RuntimeError – If one of the crosslinks or crosslink-spectrum-matches contains unknown crosslink or peptide positions. This occurs when peptides can’t be matched to proteins because of ‘X’ in protein sequences. Selecting ‘unsafe = True’ will ignore these errors and return None type positions. Reannotation might be possible with transform.reannotate_positions().

  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • KeyError – If one of the found post-translational-modifications could not be found/mapped.

Warning

MS Annika does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation. This also only applies to crosslinks and not crosslink-spectrum-matches, where this information is correctly reported and parsed.

Examples

>>> from pyXLMS.parser import read_msannika
>>> csms_from_xlsx = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx")
>>> from pyXLMS.parser import read_msannika
>>> crosslinks_from_xlsx = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx")
>>> from pyXLMS.parser import read_msannika
>>> csms_from_tsv = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.txt")
>>> from pyXLMS.parser import read_msannika
>>> crosslinks_from_tsv = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.txt")
>>> from pyXLMS.parser import read_msannika
>>> csms_and_crosslinks_from_pdresult = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1.pdResult")
pyXLMS.parser.read_mzid(
files: str | List[str] | BinaryIO,
scan_nr_parser: Callable[[str], int] | None = None,
decoy: bool | None = None,
crosslinkers: Dict[str, float] = {'ADH': 138.09054635, 'BS3': 138.06808, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'PhoX': 209.97181},
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read a mzIdentML (mzid) file.

Reads crosslink-spectrum-matches from a mzIdentML (mzid) file and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the mzIdentML (mzid) file(s) or a file-like object/stream.

  • scan_nr_parser (callable, or None, default = None) – A function that parses the scan number from mzid spectrumIDs. If None (default) the function parse_scan_nr_from_mzid() is used.

  • decoy (bool, or None, default = None) – Whether the mzid file contains decoy CSMs (True) or target CSMs (False).

  • crosslinkers (dict of str, float, default = constants.CROSSLINKERS) – Mapping of crosslinker names to crosslinker delta masses.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • RuntimeError – If parser is used with verbose = 2.

  • RuntimeError – If there are warnings while reading the mzIdentML file (only for verbose = 2).

  • TypeError – If parameter verbose was not set correctly.

  • TypeError – If one of the values necessary to create a crosslink-spectrum-match could not be parsed correctly.

Notes

This parser is experimental, as I don’t know if the mzIdentML structure is consistent accross different crosslink search engines. This parser was tested with mzIdentML files from MS Annika and XlinkX.

Warning

This parser only parses minimal data because most information is not available from the mzIdentML file. The available data is:

  • alpha_peptide

  • alpha_peptide_crosslink_position

  • beta_peptide

  • beta_peptide_crosslink_position

  • spectrum_file

  • scan_nr

Examples

>>> from pyXLMS.parser import read_mzid
>>> csms = read_mzid("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid")

Read a pLink result file.

Reads a pLink crosslink-spectrum-matches result file “*cross-linked_spectra.csv” in .csv (comma delimited) format or pLink crosslinks result file “*cross-linked_peptides.csv” in .csv (comma delimited) format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the pLink result file(s) or a file-like object/stream.

  • spectrum_file_parser (callable, or None, default = None) – A function that parses the spectrum file name from spectrum titles. If None (default) the function parse_spectrum_file_from_plink() is used.

  • scan_nr_parser (callable, or None, default = None) – A function that parses the scan number from spectrum titles. If None (default) the function parse_scan_nr_from_plink() is used.

  • decoy_prefix (str, default = "REV_") – The prefix that indicates that a protein is from the decoy database.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • sep (str, default = ",") – Seperator used in the .csv file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • TypeError – If parameter verbose was not set correctly.

Warning

Target and decoy information is derived based off the protein accession and parameter decoy_prefix. By default, pLink only reports target matches that are above the desired FDR.

Examples

>>> from pyXLMS.parser import read_plink
>>> csms = read_plink("data/plink2/Cas9_plus10_2024.06.20.filtered_cross-linked_spectra.csv")
>>> from pyXLMS.parser import read_plink
>>> crosslinks = read_plink("data/plink2/Cas9_plus10_2024.06.20.filtered_cross-linked_peptides.csv")
pyXLMS.parser.read_scout(
files: str | List[str] | BinaryIO,
crosslinker: str,
crosslinker_mass: float | None = None,
parse_modifications: bool = True,
modifications: Dict[str, Tuple[str, float]] = {'+15.994900': ('Oxidation', 15.994915), '+57.021460': ('Carbamidomethyl', 57.021464), 'ADH': ('ADH', 138.09054635), 'BS3': ('BS3', 138.06808), 'Carbamidomethyl': ('Carbamidomethyl', 57.021464), 'DSBSO': ('DSBSO', 308.03883), 'DSBU': ('DSBU', 196.08479231), 'DSS': ('DSS', 138.06808), 'DSSO': ('DSSO', 158.00376), 'Oxidation of Methionine': ('Oxidation', 15.994915), 'PhoX': ('PhoX', 209.97181)},
sep: str = ',',
decimal: str = '.',
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read a Scout result file.

Reads a Scout filtered or unfiltered crosslink-spectrum-matches result file or crosslink/residue pair result file in .csv format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the Scout result file(s) or a file-like object/stream.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, tuple, default = constants.SCOUT_MODIFICATION_MAPPING) – Mapping of Scout sequence elements (e.g. "+15.994900") and modifications (e.g "Oxidation of Methionine") to their modifications (e.g. ("Oxidation", 15.994915)).

  • sep (str, default = ",") – Seperator used in the .csv file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • KeyError – If the specified crosslinker could not be found/mapped.

  • TypeError – If parameter verbose was not set correctly.

Warning

  • When reading unfiltered crosslink-spectrum-matches, no protein crosslink positions or protein peptide positions are available, as these are not reported. If needed they should be annotated with transform.reannotate_positions().

  • When reading filtered crosslink-spectrum-matches, Scout does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink-spectrum-match are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink-spectrum-match. This leads to only TT and DD matches, which needs to be considered for FDR estimation.

  • When reading crosslinks / residue pairs, Scout does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation.

Examples

>>> from pyXLMS.parser import read_scout
>>> csms_unfiltered = read_scout("data/scout/Cas9_Unfiltered_CSMs.csv")
>>> from pyXLMS.parser import read_scout
>>> csms_filtered = read_scout("data/scout/Cas9_Filtered_CSMs.csv")
>>> from pyXLMS.parser import read_scout
>>> crosslinks = read_scout("data/scout/Cas9_Residue_Pairs.csv")
pyXLMS.parser.read_xi(
files: str | List[str] | BinaryIO,
decoy_prefix: str | None = 'auto',
parse_modifications: bool = True,
modifications: Dict[str, Tuple[str, float]] = {'->': ('Substitution', nan), 'bs3_ami': ('BS3 Amidated', 155.094619105), 'bs3_hyd': ('BS3 Hydrolized', 156.0786347), 'bs3_tris': ('BS3 Tris', 259.141973), 'bs3loop': ('BS3 Looplink', 138.06808), 'bs3nh2': ('BS3 Amidated', 155.094619105), 'bs3oh': ('BS3 Hydrolized', 156.0786347), 'cm': ('Carbamidomethyl', 57.021464), 'dsbu_ami': ('DSBU Amidated', 213.111341), 'dsbu_hyd': ('DSBU Hydrolized', 214.095357), 'dsbu_loop': ('DSBU Looplink', 196.08479231), 'dsbu_tris': ('DSBU Tris', 317.158685), 'dsbuloop': ('DSBU Looplink', 196.08479231), 'dsso_ami': ('DSSO Amidated', 175.030313905), 'dsso_hyd': ('DSSO Hydrolized', 176.0143295), 'dsso_loop': ('DSSO Looplink', 158.00376), 'dsso_tris': ('DSSO Tris', 279.077658), 'dssoloop': ('DSSO Looplink', 158.00376), 'ox': ('Oxidation', 15.994915)},
sep: str = ',',
decimal: str = '.',
ignore_errors: bool = False,
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read a xiSearch/xiFDR result file.

Reads a xiSearch crosslink-spectrum-matches result file or a xiFDR crosslink-spectrum-matches result file or crosslink result file in .csv format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the xiSearch/xiFDR result file(s) or a file-like object/stream.

  • decoy_prefix (str, or None, default = "auto") – The prefix that indicates that a protein is from the decoy database. If “auto” or None it will use the default for each xi file type.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, tuple, default = constants.XI_MODIFICATION_MAPPING) – Mapping of xi sequence elements (e.g. "cm") to their modifications (e.g. ("Carbamidomethyl", 57.021464)). This corresponds to the SYMBOLEXT field, or the SYMBOL field minus the amino acid in the xiSearch config.

  • sep (str, default = ",") – Seperator used in the .csv file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

  • ignore_errors (bool, default = False) – If modifications that are not given in parameter ‘modifications’ should raise an error or not. By default an error is raised if an unknown modification is encountered. If True modifications that are unknown are encoded with the xi shortcode (SYMBOLEXT) and float("nan") modification mass.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • TypeError – If parameter verbose was not set correctly.

Examples

>>> from pyXLMS.parser import read_xi
>>> csms_from_xiSearch = read_xi("data/xi/r1_Xi1.7.6.7.csv")
>>> from pyXLMS.parser import read_xi
>>> csms_from_xiFDR = read_xi("data/xi/1perc_xl_boost_CSM_xiFDR2.2.1.csv")
>>> from pyXLMS.parser import read_xi
>>> crosslinks_from_xiFDR = read_xi("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv")
pyXLMS.parser.read_xlinkx(
files: str | List[str] | BinaryIO,
decoy: bool | None = None,
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx', 'pdresult'] = 'auto',
sep: str = '\t',
decimal: str = '.',
ignore_errors: bool = False,
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read an XlinkX result file.

Reads an XlinkX crosslink-spectrum-matches result file or crosslink result file in .csv or .xlsx format, or both from a .pdResult file from Proteome Discover, and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the XlinkX result file(s) or a file-like object/stream.

  • decoy (bool, or None) – Default decoy value to use if no decoy value is found. Only used if the “Is Decoy” column is not found in the supplied data.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • format ("auto", "csv", "tsv", "txt", "xlsx", or "pdresult", default = "auto") – The format of the result file. "auto" is only available if the name/path to the XlinkX result file is given.

  • sep (str, default = "t") – Seperator used in the .csv or .tsv file. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • ignore_errors (bool, default = False) – If missing crosslink positions should raise an error or not. Setting this to True will suppress the RuntimeError for the crosslink position not being able to be parsed for at least one of the crosslinks. For these cases the crosslink position will be set to 100 000.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • ValueError – If the input format is not supported or cannot be inferred.

  • TypeError – If parameter verbose was not set correctly.

  • TypeError – If the pdResult file is provided in the wrong format.

  • RuntimeError – If the crosslink position could not be parsed for at least one of the crosslinks.

  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • KeyError – If one of the found post-translational-modifications could not be found/mapped.

Warning

XlinkX does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation. This applies to both crosslinks and crosslink-spectrum-matches.

Examples

>>> from pyXLMS.parser import read_xlinkx
>>> csms_from_xlsx = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_CSMs.xlsx")
>>> from pyXLMS.parser import read_xlinkx
>>> crosslinks_from_xlsx = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_Crosslinks.xlsx")
>>> from pyXLMS.parser import read_xlinkx
>>> csms_from_tsv = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_CSMs.txt")
>>> from pyXLMS.parser import read_xlinkx
>>> crosslinks_from_tsv = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_Crosslinks.txt")
>>> from pyXLMS.parser import read_xlinkx
>>> csms_and_crosslinks_from_pdresult = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3.pdResult")