pyXLMS package#
Submodules#
pyXLMS.constants module#
- pyXLMS.constants.AMINO_ACIDS = {'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y'}#
List of valid amino acids.
List of one-letter codes for all valid amino acids.
Examples
>>> from pyXLMS.constants import AMINO_ACIDS >>> "A" in AMINO_ACIDS True >>> "B" in AMINO_ACIDS False
- pyXLMS.constants.AMINO_ACIDS_1TO3 = {'A': 'ALA', 'C': 'CYS', 'D': 'ASP', 'E': 'GLU', 'F': 'PHE', 'G': 'GLY', 'H': 'HIS', 'I': 'ILE', 'K': 'LYS', 'L': 'LEU', 'M': 'MET', 'N': 'ASN', 'P': 'PRO', 'Q': 'GLN', 'R': 'ARG', 'S': 'SER', 'T': 'THR', 'V': 'VAL', 'W': 'TRP', 'Y': 'TYR'}#
Mapping of amino acid 1-letter codes to their 3-letter codes.
Mapping of all amino acid 1-letter codes to their corresponding 3-letter codes.
Examples
>>> from pyXLMS.constants import AMINO_ACIDS_1TO3 >>> AMINO_ACIDS_1TO3["G"] 'GLY'
- pyXLMS.constants.AMINO_ACIDS_3TO1 = {'ALA': 'A', 'ARG': 'R', 'ASN': 'N', 'ASP': 'D', 'CYS': 'C', 'GLN': 'Q', 'GLU': 'E', 'GLY': 'G', 'HIS': 'H', 'ILE': 'I', 'LEU': 'L', 'LYS': 'K', 'MET': 'M', 'PHE': 'F', 'PRO': 'P', 'SER': 'S', 'THR': 'T', 'TRP': 'W', 'TYR': 'Y', 'VAL': 'V'}#
Mapping of amino acid 3-letter codes to their 1-letter codes.
Mapping of all amino acid 3-letter codes to their corresponding 1-letter codes.
Examples
>>> from pyXLMS.constants import AMINO_ACIDS_3TO1 >>> AMINO_ACIDS_3TO1["GLY"] 'G'
- pyXLMS.constants.CROSSLINKERS = {'ADH': 138.09054635, 'BS3': 138.06808, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'PhoX': 209.97181}#
Dictionary of crosslinkers.
Dictionary of pre-defined crosslinkers that maps crosslinker names to crosslinker delta masses. Currently contains “BS3”, “DSS”, “DSSO”, “ADH”, “DSBSO”, “PhoX”.
Examples
>>> from pyXLMS.constants import CROSSLINKERS >>> CROSSLINKERS["BS3"] 138.06808
- pyXLMS.constants.MODIFICATIONS = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331}#
Dictionary of post-translational-modifications.
Dictionary of pre-defined post-translational-modifications that maps modification names to modification delta masses. Currently contains “Carbamidomethyl”, “Oxidation”, “Phospho”, “Acetyl” and all crosslinkers.
Examples
>>> from pyXLMS.constants import MODIFICATIONS >>> MODIFICATIONS["Carbamidomethyl"] 57.021464 >>> MODIFICATIONS["BS3"] 138.06808
- pyXLMS.constants.SCOUT_MODIFICATION_MAPPING = {'+15.994900': ('Oxidation', 15.994915), '+57.021460': ('Carbamidomethyl', 57.021464), 'ADH': ('ADH', 138.09054635), 'BS3': ('BS3', 138.06808), 'Carbamidomethyl': ('Carbamidomethyl', 57.021464), 'DSBSO': ('DSBSO', 308.03883), 'DSBU': ('DSBU', 196.08479231), 'DSS': ('DSS', 138.06808), 'DSSO': ('DSSO', 158.00376), 'Oxidation of Methionine': ('Oxidation', 15.994915), 'PhoX': ('PhoX', 209.97181)}#
Dictionary that maps sequence elements and modifications from Scout to their corresponding post-translational-modifications.
Dictionary that maps sequence elements (e.g. “+57.021460”) and modifications (e.g. “Carbamidomethyl”) from Scout to their corresponding post-translational-modifications (e.g. (“Carbamidomethyl”, 57.021464)).
Examples
>>> from pyXLMS.constants import SCOUT_MODIFICATION_MAPPING >>> SCOUT_MODIFICATION_MAPPING["+57.021460"] ('Carbamidomethyl', 57.021464) >>> SCOUT_MODIFICATION_MAPPING["Carbamidomethyl"] ('Carbamidomethyl', 57.021464) >>> SCOUT_MODIFICATION_MAPPING["Oxidation of Methionine"] ('Oxidation', 15.994915)
- pyXLMS.constants.XI_MODIFICATION_MAPPING = {'->': ('Substitution', nan), 'bs3_ami': ('BS3 Amidated', 155.094619105), 'bs3_hyd': ('BS3 Hydrolized', 156.0786347), 'bs3_tris': ('BS3 Tris', 259.141973), 'bs3loop': ('BS3 Looplink', 138.06808), 'bs3nh2': ('BS3 Amidated', 155.094619105), 'bs3oh': ('BS3 Hydrolized', 156.0786347), 'cm': ('Carbamidomethyl', 57.021464), 'dsbu_ami': ('DSBU Amidated', 213.111341), 'dsbu_hyd': ('DSBU Hydrolized', 214.095357), 'dsbu_loop': ('DSBU Looplink', 196.08479231), 'dsbu_tris': ('DSBU Tris', 317.158685), 'dsbuloop': ('DSBU Looplink', 196.08479231), 'dsso_ami': ('DSSO Amidated', 175.030313905), 'dsso_hyd': ('DSSO Hydrolized', 176.0143295), 'dsso_loop': ('DSSO Looplink', 158.00376), 'dsso_tris': ('DSSO Tris', 279.077658), 'dssoloop': ('DSSO Looplink', 158.00376), 'ox': ('Oxidation', 15.994915)}#
Dictionary that maps sequence elements from xiSearch and xiFDR to their corresponding post-translational-modifications.
Dictionary that maps sequence elements (e.g. “cm”) from xiSearch and xiFDR to their corresponding post-translational-modifications (e.g. (“Carbamidomethyl”, 57.021464)).
Examples
>>> from pyXLMS.constants import XI_MODIFICATION_MAPPING >>> XI_MODIFICATION_MAPPING["cm"] ('Carbamidomethyl', 57.021464) >>> XI_MODIFICATION_MAPPING["ox"] ('Oxidation', 15.994915)
pyXLMS.data module#
- pyXLMS.data.check_indexing(value: int | List[int]) bool [source]#
Checks that the given value is not 0-based.
- Parameters:
value (int, or list of int) – The value(s) to check.
- Returns:
If the given value(s) is/are okay.
- Return type:
bool
- Raises:
ValueError – If any of the values are smaller than one.
Examples
>>> from pyXLMS.data import check_indexing >>> check_indexing([1, 2, 3]) True
- pyXLMS.data.check_input(
- parameter: Any,
- parameter_name: str,
- supported_class: Any,
- supported_subclass: Any | None = None,
Checks if the given parameter is of the specified type.
Function that checks if a given parameter is of the specified type and if iterable, all elements are of the specified element type. This is mostly an input check function to catch any errors arising from not supported inputs early.
- Parameters:
parameter (any) – Parameter to check class of.
parameter_name (str) – Name of the parameter.
supported_class (any) – Class the parameter has to be of.
supported_subclass (any, or None, default = None) – Class of the values in case the parameter is a list or dict.
- Returns:
If the given input is okay.
- Return type:
bool
- Raises:
TypeError – If the parameter is not of the given class.
Examples
>>> from pyXLMS.data import check_input >>> check_input("PEPTIDE", "peptide_a", str) True
>>> from pyXLMS.data import check_input >>> check_input([1, 2], "xl_position_proteins_a", list, int) True
- pyXLMS.data.check_input_multi(
- parameter: Any,
- parameter_name: str,
- supported_classes: List[Any],
- supported_subclass: Any | None = None,
Checks if the given parameter is of one of the specified types.
Function that checks if a given parameter is of one of the specified types and if iterable, all elements are of the specified element type. This is mostly an input check function to catch any errors arising from not supported inputs early.
- Parameters:
parameter (any) – Parameter to check class of.
parameter_name (str) – Name of the parameter.
supported_class (list of any) – Classes the parameter has to be of.
supported_subclass (any, or None, default = None) – Class of the values in case the parameter is a list or dict.
- Returns:
If the given input is okay.
- Return type:
bool
- Raises:
TypeError – If the parameter is not of one of the given classes.
Examples
>>> from pyXLMS.data import check_input_multi >>> check_input_multi("PEPTIDE", "peptide_a", [str, list]) True
- pyXLMS.data.create_crosslink(
- peptide_a: str,
- xl_position_peptide_a: int,
- proteins_a: List[str] | None,
- xl_position_proteins_a: List[int] | None,
- decoy_a: bool | None,
- peptide_b: str,
- xl_position_peptide_b: int,
- proteins_b: List[str] | None,
- xl_position_proteins_b: List[int] | None,
- decoy_b: bool | None,
- score: float | None,
- additional_information: Dict[str, Any] | None = None,
Creates a crosslink data structure.
Contains minimal data necessary for representing a single crosslink. The returned crosslink data structure is a dictionary with keys as detailed in the return section.
- Parameters:
peptide_a (str) – The unmodified amino acid sequence of the first peptide.
xl_position_peptide_a (int) – The position of the crosslinker in the sequence of the first peptide (1-based).
proteins_a (list of str, or None) – The accessions of proteins that the first peptide is associated with.
xl_position_proteins_a (list of int, or None) – Positions of the crosslink in the proteins of the first peptide (1-based).
decoy_a (bool, or None) – Whether the alpha peptide is from the decoy database or not.
peptide_b (str) – The unmodified amino acid sequence of the second peptide.
xl_position_peptide_b (int) – The position of the crosslinker in the sequence of the second peptide (1-based).
proteins_b (list of str, or None) – The accessions of proteins that the second peptide is associated with.
xl_position_proteins_b (list of int, or None) – Positions of the crosslink in the proteins of the second peptide (1-based).
decoy_b (bool, or None) – Whether the beta peptide is from the decoy database or not.
score (float, or None) – Score of the crosslink.
additional_information (dict with str keys, or None, default = None) – A dictionary with additional information associated with the crosslink.
- Returns:
The dictionary representing the crosslink with keys
data_type
,completeness
,alpha_peptide
,alpha_peptide_crosslink_position
,alpha_proteins
,alpha_proteins_crosslink_positions
,alpha_decoy
,beta_peptide
,beta_peptide_crosslink_position
,beta_proteins
,beta_proteins_crosslink_positions
,beta_decoy
,crosslink_type
,score
, andadditional_information
. Alpha and beta are assigned based on peptide sequence, the peptide that alphabetically comes first is assigned to alpha.- Return type:
dict
- Raises:
TypeError – If the parameter is not of the given class.
ValueError – If the length of crosslink positions is not equal to the length of proteins.
Notes
The minimum required data for creating a crosslink is:
peptide_a
: The unmodified amino acid sequence of the first peptide.peptide_b
: The unmodified amino acid sequence of the second peptide.xl_position_peptide_a
: The position of the crosslinker in the sequence of the first peptide (1-based).xl_position_peptide_b
: The position of the crosslinker in the sequence of the second peptide (1-based).
Examples
>>> from pyXLMS.data import create_crosslink >>> minimal_crosslink = create_crosslink("PEPTIDEA", 1, None, None, None, "PEPTIDEB", 5, None, None, None, None) >>> crosslink = create_crosslink("PEPTIDEA", 1, ["PROTEINA"], [1], False, "PEPTIDEB", 5, ["PROTEINB"], [3], False, 34.5)
- pyXLMS.data.create_crosslink_from_csm(
- csm: Dict[str, Any],
Creates a crosslink data structure from a crosslink-spectrum-match.
Creates a crosslink data structure from a crosslink-spectrum-match. The returned crosslink data structure is a dictionary with keys as detailed in the return section.
- Parameters:
csm (dict of str) – The crosslink-spectrum-match item to be converted to a crosslink item.
- Returns:
The dictionary representing the crosslink with keys
data_type
,completeness
,alpha_peptide
,alpha_peptide_crosslink_position
,alpha_proteins
,alpha_proteins_crosslink_positions
,alpha_decoy
,beta_peptide
,beta_peptide_crosslink_position
,beta_proteins
,beta_proteins_crosslink_positions
,beta_decoy
,crosslink_type
,score
, andadditional_information
. Alpha and beta are assigned based on peptide sequence, the peptide that alphabetically comes first is assigned to alpha.- Return type:
dict
- Raises:
TypeError – If parameter
csm
is not a valid crosslink-spectrum-match.
Notes
See also
data.create_crosslink()
.Examples
>>> from pyXLMS.data import create_csm_min, create_crosslink_from_csm >>> csm = create_csm_min("PEPTIDEA", 1, "PEPTIDEB", 5, "RUN_1", 1) >>> crosslink = create_crosslink_from_csm(csm)
- pyXLMS.data.create_crosslink_min(
- peptide_a: str,
- xl_position_peptide_a: int,
- peptide_b: str,
- xl_position_peptide_b: int,
- **kwargs,
Creates a crosslink data structure from minimal input.
Contains minimal data necessary for representing a single crosslink. This is an alias for
data.create_crosslink()``that sets all optional parameters to ``None
for convenience. The returned crosslink data structure is a dictionary with keys as detailed in the return section.- Parameters:
peptide_a (str) – The unmodified amino acid sequence of the first peptide.
xl_position_peptide_a (int) – The position of the crosslinker in the sequence of the first peptide (1-based).
peptide_b (str) – The unmodified amino acid sequence of the second peptide.
xl_position_peptide_b (int) – The position of the crosslinker in the sequence of the second peptide (1-based).
**kwargs – Any additional parameters will be passed to
data.create_crosslink()
.
- Returns:
The dictionary representing the crosslink with keys
data_type
,completeness
,alpha_peptide
,alpha_peptide_crosslink_position
,alpha_proteins
,alpha_proteins_crosslink_positions
,alpha_decoy
,beta_peptide
,beta_peptide_crosslink_position
,beta_proteins
,beta_proteins_crosslink_positions
,beta_decoy
,crosslink_type
,score
, andadditional_information
. Alpha and beta are assigned based on peptide sequence, the peptide that alphabetically comes first is assigned to alpha.- Return type:
dict
Notes
See also
data.create_crosslink()
.Examples
>>> from pyXLMS.data import create_crosslink_min >>> minimal_crosslink = create_crosslink_min("PEPTIDEA", 1, "PEPTIDEB", 5)
- pyXLMS.data.create_csm(
- peptide_a: str,
- modifications_a: Dict[int, Tuple[str, float]] | None,
- xl_position_peptide_a: int,
- proteins_a: List[str] | None,
- xl_position_proteins_a: List[int] | None,
- pep_position_proteins_a: List[int] | None,
- score_a: float | None,
- decoy_a: bool | None,
- peptide_b: str,
- modifications_b: Dict[int, Tuple[str, float]] | None,
- xl_position_peptide_b: int,
- proteins_b: List[str] | None,
- xl_position_proteins_b: List[int] | None,
- pep_position_proteins_b: List[int] | None,
- score_b: float | None,
- decoy_b: bool | None,
- score: float | None,
- spectrum_file: str,
- scan_nr: int,
- charge: int | None,
- rt: float | None,
- im_cv: float | None,
- additional_information: Dict[str, Any] | None = None,
Creates a crosslink-spectrum-match data structure.
Contains minimal data necessary for representing a single crosslink-spectrum-match. The returned crosslink-spectrum-match data structure is a dictionary with keys as detailed in the return section.
- Parameters:
peptide_a (str) – The unmodified amino acid sequence of the first peptide.
modifications_a (dict of [int, tuple], or None) – The modifications of the first peptide given as a dictionary that maps peptide position (1-based) to modification given as a tuple of modification name and modification delta mass.
N-terminal
modifications should be denoted with position0
.C-terminal
modifications should be denoted with positionlen(peptide) + 1
. If the peptide is not modified an empty dictionary should be given.xl_position_peptide_a (int) – The position of the crosslinker in the sequence of the first peptide (1-based).
proteins_a (list of str, or None) – The accessions of proteins that the first peptide is associated with.
xl_position_proteins_a (list of int, or None) – Positions of the crosslink in the proteins of the first peptide (1-based).
pep_position_proteins_a (list of int, or None) – Positions of the first peptide in the corresponding proteins (1-based).
score_a (float, or None) – Identification score of the first peptide.
decoy_a (bool, or None) – Whether the alpha peptide is from the decoy database or not.
peptide_b (str) – The unmodified amino acid sequence of the second peptide.
modifications_b (dict of [int, tuple], or None) – The modifications of the second peptide given as a dictionary that maps peptide position (1-based) to modification given as a tuple of modification name and modification delta mass.
N-terminal
modifications should be denoted with position0
.C-terminal
modifications should be denoted with positionlen(peptide) + 1
. If the peptide is not modified an empty dictionary should be given.xl_position_peptide_b (int) – The position of the crosslinker in the sequence of the second peptide (1-based).
proteins_b (list of str, or None) – The accessions of proteins that the second peptide is associated with.
xl_position_proteins_b (list of int, or None) – Positions of the crosslink in the proteins of the second peptide (1-based).
pep_position_proteins_b (list of int, or None) – Positions of the second peptide in the corresponding proteins (1-based).
score_b (float, or None) – Identification score of the second peptide.
decoy_b (bool, or None) – Whether the beta peptide is from the decoy database or not.
score (float, or None) – Score of the crosslink-spectrum-match.
spectrum_file (str) – Name of the spectrum file the crosslink-spectrum-match was identified in.
scan_nr (int) – The corresponding scan number of the crosslink-spectrum-match.
charge (int, or None) – The precursor charge of the corresponding mass spectrum of the crosslink-spectrum-match.
rt (float, or None) – The retention time of the corresponding mass spectrum of the crosslink-spectrum-match in seconds.
im_cv (float, or None) – The ion mobility or compensation voltage of the corresponding mass spectrum of the crosslink-spectrum-match.
additional_information (dict with str keys, or None, default = None) – A dictionary with additional information associated with the crosslink-spectrum-match.
- Returns:
The dictionary representing the crosslink-spectrum-match with keys
data_type
,completeness
,alpha_peptide
,alpha_modifications
,alpha_peptide_crosslink_position
,alpha_proteins
,alpha_proteins_crosslink_positions
,alpha_proteins_peptide_positions
,alpha_score
,alpha_decoy
,beta_peptide
,beta_modifications
,beta_peptide_crosslink_position
,beta_proteins
,beta_proteins_crosslink_positions
,beta_proteins_peptide_positions
,beta_score
,beta_decoy
,crosslink_type
,score
,spectrum_file
,scan_nr
,retention_time
,ion_mobility
, andadditional_information
. Alpha and beta are assigned based on peptide sequence, the peptide that alphabetically comes first is assigned to alpha.- Return type:
dict
- Raises:
TypeError – If the parameter is not of the given class.
ValueError – If the length of crosslink positions or peptide positions is not equal to the length of proteins.
Notes
The minimum required data for creating a crosslink-spectrum-match is:
peptide_a
: The unmodified amino acid sequence of the first peptide.peptide_b
: The unmodified amino acid sequence of the second peptide.xl_position_peptide_a
: The position of the crosslinker in the sequence of the first peptide (1-based).xl_position_peptide_b
: The position of the crosslinker in the sequence of the second peptide (1-based).spectrum_file
: Name of the spectrum file the crosslink-spectrum-match was identified in.scan_nr
: The corresponding scan number of the crosslink-spectrum-match.
Examples
>>> from pyXLMS.data import create_csm >>> minimal_csm = create_csm("PEPTIDEA", {}, 1, None, None, None, None, None, "PEPTIDEB", {}, 5, None, None, None, None, None, None, "MS_EXP1", 1, None, None, None) >>> csm = create_csm("PEPTIDEA", {1: ("Oxidation", 15.994915)}, 1, ["PROTEINA"], [1], [1], 20.1, False, "PEPTIDEB", {}, 5, ["PROTEINB"], [3], [1], 33.7, False, 20.1, "MS_EXP1", 1, 3, 13.5, -50)
- pyXLMS.data.create_csm_min(
- peptide_a: str,
- xl_position_peptide_a: int,
- peptide_b: str,
- xl_position_peptide_b: int,
- spectrum_file: str,
- scan_nr: int,
- **kwargs,
Creates a crosslink-spectrum-match data structure from minimal input.
Contains minimal data necessary for representing a single crosslink-spectrum-match. This is an alias for
data.create_csm()``that sets all optional parameters to ``None
for convenience. The returned crosslink-spectrum-match data structure is a dictionary with keys as detailed in the return section.- Parameters:
peptide_a (str) – The unmodified amino acid sequence of the first peptide.
xl_position_peptide_a (int) – The position of the crosslinker in the sequence of the first peptide (1-based).
peptide_b (str) – The unmodified amino acid sequence of the second peptide.
xl_position_peptide_b (int) – The position of the crosslinker in the sequence of the second peptide (1-based).
spectrum_file (str) – Name of the spectrum file the crosslink-spectrum-match was identified in.
scan_nr (int) – The corresponding scan number of the crosslink-spectrum-match.
**kwargs – Any additional parameters will be passed to
data.create_csm()
.
- Returns:
The dictionary representing the crosslink-spectrum-match with keys
data_type
,completeness
,alpha_peptide
,alpha_modifications
,alpha_peptide_crosslink_position
,alpha_proteins
,alpha_proteins_crosslink_positions
,alpha_proteins_peptide_positions
,alpha_score
,alpha_decoy
,beta_peptide
,beta_modifications
,beta_peptide_crosslink_position
,beta_proteins
,beta_proteins_crosslink_positions
,beta_proteins_peptide_positions
,beta_score
,beta_decoy
,crosslink_type
,score
,spectrum_file
,scan_nr
,retention_time
,ion_mobility
, andadditional_information
. Alpha and beta are assigned based on peptide sequence, the peptide that alphabetically comes first is assigned to alpha.- Return type:
dict
Notes
See also
data.create_csm()
.Examples
>>> from pyXLMS.data import create_csm_min >>> minimal_csm = create_csm("PEPTIDEA", 1, "PEPTIDEB", 5, "MS_EXP1", 1)
- pyXLMS.data.create_parser_result(
- search_engine: str,
- csms: List[Dict[str, Any]] | None,
- crosslinks: List[Dict[str, Any]] | None,
Creates a parser result data structure.
Contains all necessary data elements that should be contained in a result returned by a crosslink search engine result parser.
- Parameters:
search_engine (str) – Name of the identifying crosslink search engine.
csms (list of dict, or None) – List of crosslink-spectrum-matches as created by
data.create_csm()
.crosslinks (list of dict, or None) – List of crosslinks as created by
data.create_crosslink()
.
- Returns:
The parser result data structure which is a dictionary with keys
data_type
,completeness
,search_engine
,crosslink-spectrum-matches
andcrosslinks
.- Return type:
dict
Examples
>>> from pyXLMS.data import create_parser_result >>> result = create_parser_result("MS Annika", None, None) >>> result["data_type"] 'parser_result' >>> result["completeness"] 'empty' >>> result["search_engine"] 'MS Annika'
pyXLMS.exporter module#
pyXLMS.exporter_to_impxfdr module#
- pyXLMS.exporter_to_impxfdr.to_impxfdr(
- data: List[Dict[str, Any]],
- filename: str | None,
- targets_only: bool = True,
Exports a list of crosslinks or crosslink-spectrum-matches to IMP-X-FDR format.
Exports a list of crosslinks or crosslink-spectrum-matches to IMP-X-FDR format for benchmarking purposes. The tool IMP-X-FDR is available from github.com/vbc-proteomics-org/imp-x-fdr. We recommend using version 1.1.0 and selecting “MS Annika” as input file format for the here exported file. A slightly modified version is available from github.com/hgb-bin-proteomics/MSAnnika_NC_Results. This version contains a few bug fixes and was used for the MS Annika 2.0 and MS Annika 3.0 publications. Requires that
alpha_proteins
,beta_proteins
,alpha_proteins_crosslink_positions
andbeta_proteins_crosslink_positions
fields are set for crosslinks and crosslink-spectrum-matches.- Parameters:
data (list of dict of str, any) – A list of crosslinks or crosslink-spectrum-matches.
filename (str, or None, default = None) – If not None, the exported data will be written to a file with the specified filename. The filename should end in “.xlsx” as the file is exported to Microsoft Excel file format.
targets_only (bool, default = True) – Whether or not only target crosslinks or crosslink-spectrum-matches should be exported. For benchmarking purposes this is usually the case. If the crosslinks or crosslink-spectrum-matches do not contain target-decoy labels this should be set to False.
- Returns:
A pandas DataFrame containing crosslinks or crosslink-spectrum-matches in IMP-X-FDR format.
- Return type:
pd.DataFrame
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If data contains elements of mixed data type.
ValueError – If the provided data contains no elements or if none of the data has target-decoy labels and parameter ‘targets_only’ is set to True.
RuntimeError – If not all of the required information is present in the input data.
Examples
>>> from pyXLMS.exporter import to_impxfdr >>> from pyXLMS.parser import read >>> pr = read("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv", engine="xiSearch/xiFDR", crosslinker="DSS") >>> crosslinks = pr["crosslinks"] >>> to_impxfdr(crosslinks, filename="crosslinks.xlsx") Crosslink Type Sequence A Position A Accession A In protein A ... Position B Accession B In protein B Best CSM Score Decoy 0 Intra VVDELV[K]VMGR 7 Cas9 753 ... 7 Cas9 753 40.679 False 1 Intra MLASAGELQ[K]GNELALPSK 10 Cas9 753 ... 7 Cas9 1226 40.231 False 2 Intra MDGTEELLV[K]LNR 10 Cas9 396 ... 10 Cas9 396 39.582 False 3 Intra MTNFD[K]NLPNEK 6 Cas9 965 ... 2 Cas9 504 35.880 False 4 Intra DFQFY[K]VR 6 Cas9 978 ... 4 Cas9 1028 35.281 False .. ... ... ... ... ... ... ... ... ... ... ... 220 Intra LP[K]YSLFELENGR 3 Cas9 866 ... 3 Cas9 1204 9.877 False 221 Intra D[K]QSGK 2 Cas9 677 ... 2 Cas9 677 9.702 False 222 Intra AGFI[K]R 5 Cas9 922 ... 11 Cas9 881 9.666 False 223 Intra E[K]IEK 2 Cas9 443 ... 1 Cas9 562 9.656 False 224 Intra LS[K]SR 3 Cas9 222 ... 3 Cas9 222 9.619 False [225 rows x 11 columns]
>>> from pyXLMS.exporter import to_impxfdr >>> from pyXLMS.parser import read >>> pr = read("data/xi/1perc_xl_boost_CSM_xiFDR2.2.1.csv", engine="xiSearch/xiFDR", crosslinker="DSS") >>> csms = pr["crosslink-spectrum-matches"] >>> to_impxfdr(csms, filename="csms.xlsx") Crosslink Type Sequence A Position A Accession A In protein A ... Position B Accession B In protein B Best CSM Score Decoy 0 Intra [K]IECFDSVEISGVEDR 1 Cas9 575 ... 1 Cas9 575 27.268 False 1 Intra LVDSTD[K]ADLR 7 Cas9 152 ... 11 Cas9 881 26.437 False 2 Intra GGLSELD[K]AGFIK 8 Cas9 917 ... 8 Cas9 917 26.134 False 3 Intra LVDSTD[K]ADLR 7 Cas9 152 ... 7 Cas9 152 25.804 False 4 Intra VVDELV[K]VMGR 7 Cas9 753 ... 7 Cas9 753 24.861 False .. ... ... ... ... ... ... ... ... ... ... ... 406 Intra [K]GILQTVK 1 Cas9 739 ... 3 Cas9 222 6.977 False 407 Intra QQLPE[K]YK 6 Cas9 350 ... 6 Cas9 350 6.919 False 408 Intra ESILP[K]R 6 Cas9 1117 ... 7 Cas9 1035 6.853 False 409 Intra LS[K]SR 3 Cas9 222 ... 2 Cas9 884 6.809 False 410 Intra QIT[K]HVAQILDSR 4 Cas9 933 ... 6 Cas9 350 6.808 False [411 rows x 11 columns]
pyXLMS.exporter_to_msannika module#
- pyXLMS.exporter_to_msannika.get_msannika_crosslink_sequence(peptide: str, crosslink_position: int) str [source]#
Returns the crosslinked peptide sequence in MS Annika format.
Returns the crosslinked peptide sequence in MS Annika format, which is the peptide amino acid sequence with the crosslinked residue in square brackets (see examples).
- Parameters:
peptide (str) – The (unmodified) amino acid sequence of the peptide.
crosslink_position (int) – Position of the crosslinker in the peptide sequence (1-based).
- Returns:
The crosslinked peptide sequence in MS Annika format.
- Return type:
str
- Raises:
ValueError – If the crosslink position is outside the peptide’s length.
Examples
>>> from pyXLMS.exporter import get_msannika_crosslink_sequence >>> get_msannika_crosslink_sequence("PEPKTIDE", 4) 'PEP[K]TIDE'
>>> from pyXLMS.exporter import get_msannika_crosslink_sequence >>> get_msannika_crosslink_sequence("KPEPTIDE", 1) '[K]PEPTIDE'
>>> from pyXLMS.exporter import get_msannika_crosslink_sequence >>> get_msannika_crosslink_sequence("PEPTIDEK", 8) 'PEPTIDE[K]'
- pyXLMS.exporter_to_msannika.to_msannika(
- data: List[Dict[str, Any]],
- filename: str | None = None,
- format: Literal['csv', 'tsv', 'xlsx'] = 'csv',
Exports a list of crosslinks or crosslink-spectrum-matches to MS Annika format.
Exports a list of crosslinks or crosslink-spectrum-matches to MS Annika format. This might be useful for tools that support MS Annika input but are not supported by pyXLMS (yet).
- Parameters:
data (list of dict of str, any) – A list of crosslinks or crosslink-spectrum-matches.
filename (str, or None, default = None) – If not None, the exported data will be written to a file with the specified filename.
format (str, one of "csv", "tsv", or "xlsx", default = "csv") – File format of the exported file if filename is not None.
- Returns:
A pandas DataFrame containing crosslinks or crosslink-spectrum-matches in MS Annika format.
- Return type:
pd.DataFrame
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If data contains elements of mixed data type.
TypeError – If parameter format is not one of ‘csv’, ‘tsv’ or ‘xlsx’.
ValueError – If the provided data contains no elements.
Warning
The MS Annika exporter will not check if all necessary information is available for the exported crosslinks or crosslink-spectrum-matches. If a value is not available it will be denoted as a missing value in the dataframe and exported file. Please make sure all necessary information is available before using the exported file with another tool! Please also note that modifications are not exported, for modification down-stream analysis please refer to
transform.to_proforma()
ortransform.to_dataframe()
!Examples
>>> from pyXLMS.exporter import to_msannika >>> from pyXLMS.data import create_crosslink_min >>> xl1 = create_crosslink_min("KPEPTIDE", 1, "PKEPTIDE", 2) >>> xl2 = create_crosslink_min("PEKPTIDE", 3, "PEPKTIDE", 4) >>> crosslinks = [xl1, xl2] >>> to_msannika(crosslinks) Crosslink Type Sequence A Position A Accession A In protein A Sequence B Position B Accession B In protein B Best CSM Score Decoy 0 Inter [K]PEPTIDE 1 None None P[K]EPTIDE 2 None None None None 1 Inter PE[K]PTIDE 3 None None PEP[K]TIDE 4 None None None None
>>> from pyXLMS.exporter import to_msannika >>> from pyXLMS.data import create_crosslink_min >>> xl1 = create_crosslink_min("KPEPTIDE", 1, "PKEPTIDE", 2) >>> xl2 = create_crosslink_min("PEKPTIDE", 3, "PEPKTIDE", 4) >>> crosslinks = [xl1, xl2] >>> df = to_msannika(crosslinks, filename = "crosslinks.csv", format = "csv")
>>> from pyXLMS.exporter import to_msannika >>> from pyXLMS.data import create_csm_min >>> csm1 = create_csm_min("KPEPTIDE", 1, "PKEPTIDE", 2, "RUN_1", 1) >>> csm2 = create_csm_min("PEKPTIDE", 3, "PEPKTIDE", 4, "RUN_1", 2) >>> csms = [csm1, csm2] >>> to_msannika(csms) Sequence Crosslink Type Sequence A Crosslinker Position A ... First Scan Charge RT [min] Compensation Voltage 0 KPEPTIDE-PKEPTIDE Inter KPEPTIDE 1 ... 1 None None None 1 PEKPTIDE-PEPKTIDE Inter PEKPTIDE 3 ... 2 None None None [2 rows x 20 columns]
>>> from pyXLMS.exporter import to_msannika >>> from pyXLMS.data import create_csm_min >>> csm1 = create_csm_min("KPEPTIDE", 1, "PKEPTIDE", 2, "RUN_1", 1) >>> csm2 = create_csm_min("PEKPTIDE", 3, "PEPKTIDE", 4, "RUN_1", 2) >>> csms = [csm1, csm2] >>> df = to_msannika(csms, filename = "csms.csv", format = "csv")
pyXLMS.exporter_to_pyxlinkviewer module#
- pyXLMS.exporter_to_pyxlinkviewer.to_pyxlinkviewer(
- crosslinks: List[Dict[str, Any]],
- pdb_file: str | BinaryIO,
- gap_open: int | float = -10.0,
- gap_extension: int | float = -1.0,
- min_sequence_identity: float = 0.8,
- allow_site_mismatch: bool = False,
- ignore_chains: List[str] = [],
- filename_prefix: str | None = None,
Exports a list of crosslinks to PyXlinkViewer format.
Exports a list of crosslinks to PyXlinkViewer format for visualization in pyMOL. The tool PyXlinkViewer is available from github.com/BobSchiffrin/PyXlinkViewer. This exporter performs basical local sequence alignment to align crosslinked peptides to a protein structure in PDB format. Gap open and gap extension penalties can be chosen as well as a threshold for sequence identity that must be satisfied in order for a match to be reported. Additionally the alignment is checked if the supposedly crosslinked residue can be modified with a crosslinker in the protein structure. Due to the alignment shift amino acids might change and a crosslink is reported at a position that is not able to react with the crosslinker. Optionally, these positions can still be reported.
- Parameters:
crosslinks (list of dict of str, any) – A list of crosslinks.
pdb_file (str, or file stream) – The name/path of the PDB file or a file-like object/stream. If a string is provided but no file is found locally, it’s assumed to be an identifier and the file is fetched from the PDB.
gap_open (int, or float, default = -10.0) – Gap open penalty for sequence alignment.
gap_extension (int, or float, default = -1.0,) – Gap extension penalty for sequence alignment.
min_sequence_identity (float, default = 0.8) – Minimum sequence identity to consider an aligned crosslinked peptide a match with its corresponding position in the protein structure. Should be given as a fraction between 0 and 1, e.g. the default of 0.8 corresponds to a minimum of 80% sequence identity.
allow_site_mismatch (bool, default = False) – If the crosslink position after alignment is not a reactive amino acid in the protein structure, should the position still be reported. By default such cases are not reported.
ignore_chains (list of str, default = empty list) – A list of chains to ignore in the protein structure.
filename_prefix (str, or None, default = None) – If not None, the exported data will be written to files with the specified filename prefix. The full list of written files can be accessed via the returned dictionary.
- Returns:
Returns a dictionary with key
PyXlinkViewer
containing the formatted text for PyXlinkViewer, with keyPyXlinkViewer DataFrame
containing the information fromPyXlinkViewer
but as a pandas DataFrame, with keyNumber of mapped crosslinks
containing the total number of mapped crosslinks, with keyMapping
containing a string that logs how crosslinks were mapped to the protein structure, with keyParsed PDB sequence
containing the protein sequence that was parsed from the PDB file, with keyParsed PDB chains
containing the parsed chains from the PDB file, with keyParsed PDB residue numbers
containing the parsed residue numbers from the PDB file, and with keyExported files
containing a list of filenames of all files that were written to disk.- Return type:
dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If data contains elements of mixed data type.
ValueError – If parameter min_sequence_identity is out of bounds.
ValueError – If the provided data contains no elements.
Examples
>>> from pyXLMS.exporter import to_pyxlinkviewer >>> from pyXLMS.parser import read_custom >>> pr = read_custom("data/_test/exporter/pyxlinkviewer/unique_links_all_pyxlms.csv") >>> crosslinks = pr["crosslinks"] >>> pyxlinkviewer_result = to_pyxlinkviewer(crosslinks, pdb_file="6YHU", filename_prefix="6YHU") >>> pyxlinkviewer_output_file_str = pyxlinkviewer_result["PyXlinkViewer"] >>> pyxlinkviewer_dataframe = pyxlinkviewer_result["PyXlinkViewer DataFrame"] >>> nr_mapped_crosslinks = pyxlinkviewer_result["Number of mapped crosslinks"] >>> crosslink_mapping = pyxlinkviewer_result["Mapping"] >>> parsed_pdb_sequenece = pyxlinkviewer_result["Parsed PDB sequence"] >>> parsed_pdb_chains = pyxlinkviewer_result["Parsed PDB chains"] >>> parsed_pdb_residue_numbers = pyxlinkviewer_result["Parsed PDB residue numbers"] >>> exported_files = pyxlinkviewer_result["Exported files"]
pyXLMS.exporter_to_xifdr module#
- pyXLMS.exporter_to_xifdr.to_xifdr(
- csms: List[Dict[str, Any]],
- filename: str | None,
Exports a list of crosslink-spectrum-matches to xiFDR format.
Exports a list of crosslinks to xiFDR format. The tool xiFDR is accessible via the link rappsilberlab.org/software/xifdr. Requires that
alpha_proteins
,beta_proteins
,alpha_proteins_peptide_positions
,beta_proteins_peptide_positions
,alpha_decoy
,beta_decoy
,charge
andscore
fields are set for all crosslink-spectrum-matches.- Parameters:
csms (list of dict of str, any) – A list of crosslink-spectrum-matches.
filename (str, or None) – If not None, the exported data will be written to a file with the specified filename.
- Returns:
A pandas DataFrame containing crosslink-spectrum-matches in xiFDR format.
- Return type:
pd.DataFrame
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If ‘csms’ parameter contains elements of mixed data type.
ValueError – If the provided ‘csms’ parameter contains no elements.
RuntimeError – If not all of the required information is present in the input data.
Examples
>>> from pyXLMS.exporter import to_xifdr >>> from pyXLMS.parser import read >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> csms = pr["crosslink-spectrum-matches"] >>> to_xifdr(csms, filename="msannika_xiFDR.csv") run scan peptide1 ... peptide position 1 peptide position 2 score 0 XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw 2257 GQKNSR ... 777 777 119.83 1 XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw 2448 GQKNSR ... 777 693 13.91 2 XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw 2561 SDKNR ... 864 864 114.43 3 XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw 2719 DKQSGK ... 676 676 200.98 4 XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw 2792 DKQSGK ... 676 45 94.47 .. ... ... ... ... ... ... ... 821 XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw 23297 MDGTEELLVKLNR ... 387 387 286.05 822 XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw 23454 KIECFDSVEISGVEDR ... 575 682 376.15 823 XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw 23581 SSFEKNPIDFLEAK ... 1176 1176 412.44 824 XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw 23683 SSFEKNPIDFLEAK ... 1176 1176 437.10 825 XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw 27087 MEDESKLHKFKDFK ... 99 1176 15.89 [826 rows x 14 columns]
>>> from pyXLMS.exporter import to_xifdr >>> from pyXLMS.parser import read >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> csms = pr["crosslink-spectrum-matches"] >>> df = to_xifdr(csms, filename=None)
pyXLMS.exporter_to_xinet module#
- pyXLMS.exporter_to_xinet.to_xinet(
- crosslinks: List[Dict[str, Any]],
- filename: str | None,
Exports a list of crosslinks to xiNET format.
Exports a list of crosslinks to xiNET format. The tool xiNET is accessible via the link crosslinkviewer.org. Requires that
alpha_proteins
,beta_proteins
,alpha_proteins_crosslink_positions
andbeta_proteins_crosslink_positions
fields are set for all crosslinks.- Parameters:
crosslinks (list of dict of str, any) – A list of crosslinks.
filename (str, or None) – If not None, the exported data will be written to a file with the specified filename.
- Returns:
A pandas DataFrame containing crosslinks in xiNET format.
- Return type:
pd.DataFrame
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If ‘crosslinks’ parameter contains elements of mixed data type.
ValueError – If the provided ‘crosslinks’ parameter contains no elements.
RuntimeError – If not all of the required information is present in the input data.
Notes
The optional
Score
column in the xiNET table will only be available if all crosslinks have assigned scores.Examples
>>> from pyXLMS.exporter import to_xinet >>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> from pyXLMS.transform import filter_proteins >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> crosslinks = targets_only(pr)["crosslinks"] >>> cas9 = filter_proteins(crosslinks, proteins=["Cas9"])["Both"] >>> to_xinet(cas9, filename="crosslinks_xiNET.csv") Protein1 PepPos1 PepSeq1 LinkPos1 Protein2 PepPos2 PepSeq2 LinkPos2 Score Id 0 Cas9 777 GQKNSR 3 Cas9 777 GQKNSR 3 119.83 1 1 Cas9 864 SDKNR 3 Cas9 864 SDKNR 3 114.43 2 2 Cas9 676 DKQSGK 2 Cas9 676 DKQSGK 2 200.98 3 3 Cas9 676 DKQSGK 2 Cas9 45 HSIKK 4 94.47 4 4 Cas9 31 VPSKK 4 Cas9 31 VPSKK 4 110.48 5 .. ... ... ... ... ... ... ... ... ... ... 248 Cas9 387 MDGTEELLVKLNR 10 Cas9 387 MDGTEELLVKLNR 10 305.63 249 249 Cas9 682 TILDFLKSDGFANR 7 Cas9 947 YDENDKLIR 6 110.46 250 250 Cas9 788 IEEGIKELGSQILK 6 Cas9 1176 SSFEKNPIDFLEAK 5 288.36 251 251 Cas9 575 KIECFDSVEISGVEDR 1 Cas9 682 TILDFLKSDGFANR 7 376.15 252 252 Cas9 1176 SSFEKNPIDFLEAK 5 Cas9 1176 SSFEKNPIDFLEAK 5 437.10 253 [253 rows x 10 columns]
>>> from pyXLMS.exporter import to_xinet >>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> from pyXLMS.transform import filter_proteins >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> crosslinks = targets_only(pr)["crosslinks"] >>> cas9 = filter_proteins(crosslinks, proteins=["Cas9"])["Both"] >>> df = to_xinet(cas9, filename=None)
pyXLMS.exporter_to_xiview module#
- pyXLMS.exporter_to_xiview.to_xiview(
- crosslinks: List[Dict[str, Any]],
- filename: str | None,
- minimal: bool = True,
Exports a list of crosslinks to xiVIEW format.
Exports a list of crosslinks to xiVIEW format. The tool xiVIEW is accessible via the link xiview.org/. Requires that
alpha_proteins
,beta_proteins
,alpha_proteins_crosslink_positions
andbeta_proteins_crosslink_positions
fields are set for all crosslinks.- Parameters:
crosslinks (list of dict of str, any) – A list of crosslinks.
filename (str, or None) – If not None, the exported data will be written to a file with the specified filename.
minimal (bool, default = True) – Which xiVIEW format to return, if
minimal = True
the minimal xiVIEW format is returned. Otherwise the “CSV without peak lists” format is returned (internally this just callsexporter.to_xinet()
). For more information on the xiVIEW formats please refer to the xiVIEW specification.
- Returns:
A pandas DataFrame containing crosslinks in xiVIEW format.
- Return type:
pd.DataFrame
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If ‘crosslinks’ parameter contains elements of mixed data type.
ValueError – If the provided ‘crosslinks’ parameter contains no elements.
RuntimeError – If not all of the required information is present in the input data.
Notes
The optional
Score
column in the xiVIEW table will only be available if all crosslinks have assigned scores, the optionalDecoy*
columns will only be available if all crosslinks have assigned target and decoy labels.Examples
>>> from pyXLMS.exporter import to_xiview >>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> from pyXLMS.transform import filter_proteins >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> crosslinks = targets_only(pr)["crosslinks"] >>> cas9 = filter_proteins(crosslinks, proteins=["Cas9"])["Both"] >>> to_xiview(cas9, filename="crosslinks_xiVIEW.csv") AbsPos1 AbsPos2 Protein1 Protein2 Decoy1 Decoy2 Score 0 779 779 Cas9 Cas9 FALSE FALSE 119.83 1 866 866 Cas9 Cas9 FALSE FALSE 114.43 2 677 677 Cas9 Cas9 FALSE FALSE 200.98 3 677 48 Cas9 Cas9 FALSE FALSE 94.47 4 34 34 Cas9 Cas9 FALSE FALSE 110.48 .. ... ... ... ... ... ... ... 248 396 396 Cas9 Cas9 FALSE FALSE 305.63 249 688 952 Cas9 Cas9 FALSE FALSE 110.46 250 793 1180 Cas9 Cas9 FALSE FALSE 288.36 251 575 688 Cas9 Cas9 FALSE FALSE 376.15 252 1180 1180 Cas9 Cas9 FALSE FALSE 437.10 [253 rows x 7 columns]
>>> from pyXLMS.exporter import to_xiview >>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> from pyXLMS.transform import filter_proteins >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> crosslinks = targets_only(pr)["crosslinks"] >>> cas9 = filter_proteins(crosslinks, proteins=["Cas9"])["Both"] >>> df = to_xiview(cas9, filename=None)
>>> from pyXLMS.exporter import to_xiview >>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> from pyXLMS.transform import filter_proteins >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> crosslinks = targets_only(pr)["crosslinks"] >>> cas9 = filter_proteins(crosslinks, proteins=["Cas9"])["Both"] >>> to_xiview(cas9, filename="crosslinks_xiVIEW.csv", minimal=False) Protein1 PepPos1 PepSeq1 LinkPos1 Protein2 PepPos2 PepSeq2 LinkPos2 Score Id 0 Cas9 777 GQKNSR 3 Cas9 777 GQKNSR 3 119.83 1 1 Cas9 864 SDKNR 3 Cas9 864 SDKNR 3 114.43 2 2 Cas9 676 DKQSGK 2 Cas9 676 DKQSGK 2 200.98 3 3 Cas9 676 DKQSGK 2 Cas9 45 HSIKK 4 94.47 4 4 Cas9 31 VPSKK 4 Cas9 31 VPSKK 4 110.48 5 .. ... ... ... ... ... ... ... ... ... ... 248 Cas9 387 MDGTEELLVKLNR 10 Cas9 387 MDGTEELLVKLNR 10 305.63 249 249 Cas9 682 TILDFLKSDGFANR 7 Cas9 947 YDENDKLIR 6 110.46 250 250 Cas9 788 IEEGIKELGSQILK 6 Cas9 1176 SSFEKNPIDFLEAK 5 288.36 251 251 Cas9 575 KIECFDSVEISGVEDR 1 Cas9 682 TILDFLKSDGFANR 7 376.15 252 252 Cas9 1176 SSFEKNPIDFLEAK 5 Cas9 1176 SSFEKNPIDFLEAK 5 437.10 253 [253 rows x 10 columns]
pyXLMS.exporter_to_xlinkdb module#
- pyXLMS.exporter_to_xlinkdb.to_xlinkdb(
- crosslinks: List[Dict[str, Any]],
- filename: str | None,
Exports a list of crosslinks to XlinkDB format.
Exports a list of crosslinks to XlinkDB format. The tool XlinkDB is accessible via the link xlinkdb.gs.washington.edu/xlinkdb. Requires that
alpha_proteins
andbeta_proteins
fields are set for all crosslinks.- Parameters:
crosslinks (list of dict of str, any) – A list of crosslinks.
filename (str, or None) – If not None, the exported data will be written to a file with the specified filename. The filename should not contain a file extension and consist only of alpha-numeric characters (a-Z, 0-9).
- Returns:
A pandas DataFrame containing crosslinks in XlinkDB format.
- Return type:
pd.DataFrame
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If ‘crosslinks’ parameter contains elements of mixed data type.
ValueError – If the filename contains any non-alpha-numeric characters.
ValueError – If the provided ‘crosslinks’ parameter contains no elements.
RuntimeError – If not all of the required information is present in the input data.
Notes
XlinkDB input format requires a column with probabilities that the crosslinks are correct. Since that is not available from most crosslink search engines, this is simply set to a constant
1
.Examples
>>> from pyXLMS.exporter import to_xlinkdb >>> from pyXLMS.parser import read >>> pr = read("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv", engine="xiSearch/xiFDR", crosslinker="DSS") >>> crosslinks = pr["crosslinks"] >>> to_xlinkdb(crosslinks, filename="crosslinksForXlinkDB") Peptide A Protein A Labeled Position A Peptide B Protein B Labeled Position B Probability 0 VVDELVKVMGR Cas9 6 VVDELVKVMGR Cas9 6 1 1 MLASAGELQKGNELALPSK Cas9 9 VVDELVKVMGR Cas9 6 1 2 MDGTEELLVKLNR Cas9 9 MDGTEELLVKLNR Cas9 9 1 3 MTNFDKNLPNEK Cas9 5 SKLVSDFR Cas9 1 1 4 DFQFYKVR Cas9 5 MIAKSEQEIGK Cas9 3 1 .. ... ... ... ... ... ... ... 222 LPKYSLFELENGR Cas9 2 SDKNR Cas9 2 1 223 DKQSGK Cas9 1 DKQSGK Cas9 1 1 224 AGFIKR Cas9 4 SDNVPSEEVVKK Cas9 10 1 225 EKIEK Cas9 1 KVTVK Cas9 0 1 226 LSKSR Cas9 2 LSKSR Cas9 2 1 [227 rows x 7 columns]
>>> from pyXLMS.exporter import to_xlinkdb >>> from pyXLMS.parser import read >>> pr = read("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv", engine="xiSearch/xiFDR", crosslinker="DSS") >>> crosslinks = pr["crosslinks"] >>> df = to_xlinkdb(crosslinks, filename=None)
pyXLMS.exporter_to_xlmstools module#
- pyXLMS.exporter_to_xlmstools.to_xlmstools(
- crosslinks: List[Dict[str, Any]],
- pdb_file: str | BinaryIO,
- gap_open: int | float = -10.0,
- gap_extension: int | float = -1.0,
- min_sequence_identity: float = 0.8,
- allow_site_mismatch: bool = False,
- ignore_chains: List[str] = [],
- filename_prefix: str | None = None,
Exports a list of crosslinks to xlms-tools format.
Exports a list of crosslinks to xlms-tools format for protein structure analysis. The python package xlms-tools is available from gitlab.com/topf-lab/xlms-tools. This exporter performs basical local sequence alignment to align crosslinked peptides to a protein structure in PDB format. Gap open and gap extension penalties can be chosen as well as a threshold for sequence identity that must be satisfied in order for a match to be reported. Additionally the alignment is checked if the supposedly crosslinked residue can be modified with a crosslinker in the protein structure. Due to the alignment shift amino acids might change and a crosslink is reported at a position that is not able to react with the crosslinker. Optionally, these positions can still be reported.
- Parameters:
crosslinks (list of dict of str, any) – A list of crosslinks.
pdb_file (str, or file stream) – The name/path of the PDB file or a file-like object/stream. If a string is provided but no file is found locally, it’s assumed to be an identifier and the file is fetched from the PDB.
gap_open (int, or float, default = -10.0) – Gap open penalty for sequence alignment.
gap_extension (int, or float, default = -1.0,) – Gap extension penalty for sequence alignment.
min_sequence_identity (float, default = 0.8) – Minimum sequence identity to consider an aligned crosslinked peptide a match with its corresponding position in the protein structure. Should be given as a fraction between 0 and 1, e.g. the default of 0.8 corresponds to a minimum of 80% sequence identity.
allow_site_mismatch (bool, default = False) – If the crosslink position after alignment is not a reactive amino acid in the protein structure, should the position still be reported. By default such cases are not reported.
ignore_chains (list of str, default = empty list) – A list of chains to ignore in the protein structure.
filename_prefix (str, or None, default = None) – If not None, the exported data will be written to files with the specified filename prefix. The full list of written files can be accessed via the returned dictionary.
- Returns:
Returns a dictionary with key
xlms-tools
containing the formatted text for xlms-tools, with keyxlms-tools DataFrame
containing the information fromxlms-tools
but as a pandas DataFrame, with keyNumber of mapped crosslinks
containing the total number of mapped crosslinks, with keyMapping
containing a string that logs how crosslinks were mapped to the protein structure, with keyParsed PDB sequence
containing the protein sequence that was parsed from the PDB file, with keyParsed PDB chains
containing the parsed chains from the PDB file, with keyParsed PDB residue numbers
containing the parsed residue numbers from the PDB file, and with keyExported files
containing a list of filenames of all files that were written to disk.- Return type:
dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If data contains elements of mixed data type.
ValueError – If parameter min_sequence_identity is out of bounds.
ValueError – If the provided data contains no elements.
Notes
Internally this exporter just calls
exporter.to_pyxlinkviewer()
and re-writes some of the files since the two tools share the same input file structure.Examples
>>> from pyXLMS.exporter import to_xlmstools >>> from pyXLMS.parser import read_custom >>> pr = read_custom("data/_test/exporter/xlms-tools/unique_links_all_pyxlms.csv") >>> crosslinks = pr["crosslinks"] >>> xlmstools_result = to_xlmstools(crosslinks, pdb_file="6YHU", filename_prefix="6YHU") >>> xlmstools_output_file_str = xlmstools_result["xlms-tools"] >>> xlmstools_dataframe = xlmstools_result["xlms-tools DataFrame"] >>> nr_mapped_crosslinks = xlmstools_result["Number of mapped crosslinks"] >>> crosslink_mapping = xlmstools_result["Mapping"] >>> parsed_pdb_sequenece = xlmstools_result["Parsed PDB sequence"] >>> parsed_pdb_chains = xlmstools_result["Parsed PDB chains"] >>> parsed_pdb_residue_numbers = xlmstools_result["Parsed PDB residue numbers"] >>> exported_files = xlmstools_result["Exported files"]
pyXLMS.exporter_to_xmas module#
- pyXLMS.exporter_to_xmas.to_xmas(
- crosslinks: List[Dict[str, Any]],
- filename: str | None,
Exports a list of crosslinks to XMAS format.
Exports a list of crosslinks to XMAS format for visualization in ChimeraX. The tool XMAS is available from github.com/ScheltemaLab/ChimeraX_XMAS_bundle.
- Parameters:
crosslinks (list of dict of str, any) – A list of crosslinks.
filename (str, or None) – If not None, the exported data will be written to a file with the specified filename.
- Returns:
A pandas DataFrame containing crosslinks in XMAS format.
- Return type:
pd.DataFrame
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If ‘crosslinks’ parameter contains elements of mixed data type.
ValueError – If the provided ‘crosslinks’ parameter contains no elements.
Examples
>>> from pyXLMS.exporter import to_xmas >>> from pyXLMS.data import create_crosslink_min >>> xl1 = create_crosslink_min("KPEPTIDE", 1, "PKEPTIDE", 2) >>> xl2 = create_crosslink_min("PEKPTIDE", 3, "PEPKTIDE", 4) >>> crosslinks = [xl1, xl2] >>> to_xmas(crosslinks, filename="crosslinks_xmas.xlsx") Sequence A Sequence B 0 [K]PEPTIDE P[K]EPTIDE 1 PE[K]PTIDE PEP[K]TIDE
>>> from pyXLMS.exporter import to_xmas >>> from pyXLMS.data import create_crosslink_min >>> xl1 = create_crosslink_min("KPEPTIDE", 1, "PKEPTIDE", 2) >>> xl2 = create_crosslink_min("PEKPTIDE", 3, "PEPKTIDE", 4) >>> crosslinks = [xl1, xl2] >>> to_xmas(crosslinks, filename=None) Sequence A Sequence B 0 [K]PEPTIDE P[K]EPTIDE 1 PE[K]PTIDE PEP[K]TIDE
pyXLMS.exporter_util module#
pyXLMS.parser module#
- pyXLMS.parser.read(
- files: str | List[str] | BinaryIO,
- engine: Literal['Custom', 'MaxQuant', 'MaxLynx', 'MS Annika', 'mzIdentML', 'pLink', 'Scout', 'xiSearch/xiFDR', 'XlinkX'],
- crosslinker: str,
- parse_modifications: bool = True,
- ignore_errors: bool = False,
- verbose: Literal[0, 1, 2] = 1,
- **kwargs,
Read a crosslink result file.
Reads a crosslink or crosslink-spectrum-match result file from any of the supported crosslink search engines or formats. Currently supports results files from MaxLynx/MaxQuant, MS Annika, pLink 2 and pLink 3, Scout, xiSearch and xiFDR, XlinkX, and the mzIdentML format. Additionally supports parsing from custom
.csv
files in pyXLMS format, see more about the custom format inparser.read_custom()
and in here: docs.- Parameters:
files (str, list of str, or file stream) – The name/path of the result file(s) or a file-like object/stream.
engine ("Custom", "MaxQuant", "MaxLynx", "MS Annika", "mzIdentML", "pLink", "Scout", "xiSearch/xiFDR", or "XlinkX") – Crosslink search engine or format of the result file.
crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.
parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter for every parser. Defaults are selected for every parser if ‘modifications’ is not passed via
**kwargs
.ignore_errors (bool, default = False) – Ignore errors when mapping modifications. Used in
parser.read_xi()
andparser.read_xlinkx()
.verbose (0, 1, or 2, default = 1) –
0: All warnings are ignored.
1: Warnings are printed to stdout.
2: Warnings are treated as errors.
**kwargs – Any additional parameters will be passed to the specific parsers.
- Returns:
The
parser_result
object containing all parsed information.- Return type:
dict
- Raises:
ValueError – If the value entered for parameter
engine
is not supported.
Examples
>>> from pyXLMS.parser import read >>> csms_from_xiSearch = read("data/xi/r1_Xi1.7.6.7.csv", engine="xiSearch/xiFDR", crosslinker="DSS")
>>> from pyXLMS.parser import read >>> csms_from_MaxQuant = read("data/maxquant/run1/crosslinkMsms.txt", engine="MaxQuant", crosslinker="DSS")
pyXLMS.parser_util module#
- pyXLMS.parser_util.format_sequence(
- sequence: str,
- remove_non_aa: bool = True,
- remove_lower: bool = True,
Formats the given amino acid sequence into common represenation.
The given amino acid sequence is re-formatted by converting all amino acids to upper case and optionally removing non-encoding and lower case characters.
- Parameters:
sequence (str) – The amino acid sequence that should be formatted. Post-translational-modifications can be included in lower case but will be removed.
remove_non_aa (bool, default = True) – Whether or not to remove characters that do not encode amino acids.
remove_lower (bool, default = True) – Whether or not to remove lower case characters, this should be true if the amino acid sequence encodes post-translational-modifications in lower case.
- Returns:
The formatted sequence.
- Return type:
str
Examples
>>> from pyXLMS.parser_util import format_sequence >>> format_sequence("PEP[K]TIDE") 'PEPKTIDE'
>>> from pyXLMS.parser_util import format_sequence >>> format_sequence("PEPKdssoTIDE") 'PEPKTIDE'
>>> from pyXLMS.parser_util import format_sequence >>> format_sequence("peptide", remove_lower = False) 'PEPTIDE'
- pyXLMS.parser_util.get_bool_from_value(value: Any) bool [source]#
Parse a bool value from the given input.
Tries to parse a boolean value from the given input object. If the object is of instance
bool
it will return the object, if it is of instanceint
it will returnTrue
if the object is1
orFalse
if the object is0
, any other number will raise aValueError
. If the object is of instancestr
it will returnTrue
if the lower case version contains the lettert
and otherwiseFalse
. If the object is none of these types aValueError
will be raised.- Parameters:
value (Any) – The value to parse from.
- Returns:
The parsed boolean value.
- Return type:
bool
- Raises:
ValueError – If the object could not be parsed to bool.
Examples
>>> from pyXLMS.parser_util import get_bool_from_value >>> get_bool_from_value(0) False
>>> from pyXLMS.parser_util import get_bool_from_value >>> get_bool_from_value("T") True
pyXLMS.parser_xldbse_custom module#
- pyXLMS.parser_xldbse_custom.pyxlms_modification_str_parser(
- modifications: str,
Parse a pyXLMS modification string.
Parses a pyXLMS modification string and returns the pyXLMS specific modification object, a dictionary that maps positions to their modififications.
- Parameters:
modifications (str) – The pyXLMS modification string.
- Returns:
The pyXLMS specific modification object, a dictionary that maps positions (1-based) to their respective modifications given as tuples of modification name and modification delta mass.
- Return type:
dict of int, tuple
- Raises:
RuntimeError – If multiple modifications on the same residue are parsed.
Examples
>>> from pyXLMS.parser import pyxlms_modification_str_parser >>> modification_str = "(1:[DSS|138.06808])" >>> pyxlms_modification_str_parser(modification_str) {1: ('DSS', 138.06808)}
>>> from pyXLMS.parser import pyxlms_modification_str_parser >>> modification_str = "(1:[DSS|138.06808]);(7:[Oxidation|15.994915])" >>> pyxlms_modification_str_parser(modification_str) {1: ('DSS', 138.06808), 7: ('Oxidation', 15.994915)}
- pyXLMS.parser_xldbse_custom.read_custom(
- files: str | List[str] | BinaryIO,
- column_mapping: Dict[str, str] | None = None,
- parse_modifications: bool = True,
- modification_parser: Callable[[str], Dict[int, Tuple[str, float]]] | None = None,
- decoy_prefix: str = 'REV_',
- format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx'] = 'auto',
- sep: str = ',',
- decimal: str = '.',
Read a custom or pyXLMS result file.
Reads a custom or pyXLMS crosslink-spectrum-matches result file or crosslink result file in
.csv
or.xlsx
format, and returns aparser_result
.The minimum required columns for a crosslink-spectrum-matches result file are:
“Alpha Peptide”: The unmodified amino acid sequence of the first peptide.
“Alpha Peptide Crosslink Position”: The position of the crosslinker in the sequence of the first peptide (1-based).
“Beta Peptide”: The unmodified amino acid sequence of the second peptide.
“Beta Peptide Crosslink Position”: The position of the crosslinker in the sequence of the second peptide (1-based).
“Spectrum File”: Name of the spectrum file the crosslink-spectrum-match was identified in.
“Scan Nr”: The corresponding scan number of the crosslink-spectrum-match.
The minimum required columns for crosslink result file are:
“Alpha Peptide”: The unmodified amino acid sequence of the first peptide.
“Alpha Peptide Crosslink Position”: The position of the crosslinker in the sequence of the first peptide (1-based).
“Beta Peptide”: The unmodified amino acid sequence of the second peptide.
“Beta Peptide Crosslink Position”: The position of the crosslinker in the sequence of the second peptide (1-based).
A full specification of columns that can be parsed can be found in the docs.
- Parameters:
files (str, list of str, or file stream) – The name/path of the result file(s) or a file-like object/stream.
column_mapping (dict of str, str) – A dictionary that maps the result file columns to the required pyXLMS column names.
parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modification_parser’ parameter.
modification_parser (callable, or None) – A function that parses modification strings and returns the pyXLMS specific modifications object. If None, the function
pyxlms_modification_str_parser()
is used. If no modification columns are given this parameter is ignored.decoy_prefix (str, default = "REV_") – The prefix that indicates that a protein is from the decoy database.
format ("auto", "csv", "tsv", "txt", or "xlsx", default = "auto") – The format of the result file.
"auto"
is only available if the name/path to the result file is given.sep (str, default = ",") – Seperator used in the
.csv
or.tsv
file. Parameter is ignored if the file is in.xlsx
format.decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in
.xlsx
format.
- Returns:
The
parser_result
object containing all parsed information.- Return type:
dict
- Raises:
ValueError – If the input format is not supported or cannot be inferred.
TypeError – If one of the values could not be parsed.
RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.
Examples
>>> from pyXLMS.parser import read_custom >>> csms_from_pyxlms = read_custom("data/pyxlms/csm.txt")
>>> from pyXLMS.parser import read_custom >>> crosslinks_from_pyxlms = read_custom("data/pyxlms/xl.txt")
pyXLMS.parser_xldbse_maxquant module#
- pyXLMS.parser_xldbse_maxquant.parse_modifications_from_maxquant_sequence(
- seq: str,
- crosslink_position: int,
- crosslinker: str,
- crosslinker_mass: float,
- modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
Parse post-translational-modifications from a MaxQuant peptide sequence.
Parses post-translational-modifications (PTMs) from a MaxQuant peptide sequence, for example “_VVDELVKVM(Oxidation (M))GR_”.
- Parameters:
seq (str) – The MaxQuant sequence string.
crosslink_position (int) – Position of the crosslinker in the sequence (1-based).
crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.
crosslinker_mass (float) – Monoisotopic delta mass of the crosslink modification.
modifications (dict of str, float, default =
constants.MODIFICATIONS
) – Mapping of modification names to modification masses.
- Returns:
The
pyXLMS
specific modifications object, a dictionary that maps positions to their corresponding modifications and their monoisotopic masses.- Return type:
dict of int, tuple
- Raises:
RuntimeError – If the sequence could not be parsed because it is not in MaxQuant format.
RuntimeError – If multiple modifications on the same residue are parsed.
KeyError – If an unknown modification is encountered.
Examples
>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence >>> seq = "_VVDELVKVM(Oxidation (M))GR_" >>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808) {2: ('DSS', 138.06808), 9: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence >>> seq = "_VVDELVKVM(Oxidation (M))GRM(Oxidation (M))_" >>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808) {2: ('DSS', 138.06808), 9: ('Oxidation', 15.994915), 12: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence >>> seq = "_M(Oxidation (M))VVDELVKVM(Oxidation (M))GRM(Oxidation (M))_" >>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808) {2: ('DSS', 138.06808), 1: ('Oxidation', 15.994915), 10: ('Oxidation', 15.994915), 13: ('Oxidation', 15.994915)}
- pyXLMS.parser_xldbse_maxquant.read_maxlynx(
- files: str | List[str] | BinaryIO,
- crosslinker: str,
- crosslinker_mass: float | None = None,
- decoy_prefix: str = 'REV__',
- parse_modifications: bool = True,
- modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
- sep: str = '\t',
- decimal: str = '.',
Read a MaxLynx result file.
Reads a MaxLynx crosslink-spectrum-matches result file “crosslinkMsms.txt” in
.txt
(tab delimited) format and returns aparser_result
. This is an alias for the MaxQuant reader.- Parameters:
files (str, list of str, or file stream) – The name/path of the MaxLynx result file(s) or a file-like object/stream.
crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.
crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.
decoy_prefix (str, default = "REV__") – The prefix that indicates that a protein is from the decoy database.
parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.
modifications (dict of str, float, default =
constants.MODIFICATIONS
) – Mapping of modification names to modification masses.sep (str, default = "t") – Seperator used in the
.txt
file.decimal (str, default = ".") – Character to recognize as decimal point.
- Returns:
The
parser_result
object containing all parsed information.- Return type:
dict
- Raises:
RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.
KeyError – If the specified crosslinker could not be found/mapped.
Warning
MaxLynx/MaxQuant only reports a single protein crosslink position per peptide, for ambiguous peptides only the crosslink position of the first matching protein is reported. All matching proteins can be retrieved via
additional_information
, however not their corresponding crosslink positions. For this reason it is recommended to usetransform.reannotate_positions()
to correctly annotate all crosslink positions for all peptides if that is important for downstream analysis.Examples
>>> from pyXLMS.parser import read_maxlynx >>> csms_from_xlsx = read_maxlynx("data/maxquant/run1/crosslinkMsms.txt")
- pyXLMS.parser_xldbse_maxquant.read_maxquant(
- files: str | List[str] | BinaryIO,
- crosslinker: str,
- crosslinker_mass: float | None = None,
- decoy_prefix: str = 'REV__',
- parse_modifications: bool = True,
- modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
- sep: str = '\t',
- decimal: str = '.',
Read a MaxQuant result file.
Reads a MaxQuant crosslink-spectrum-matches result file “crosslinkMsms.txt” in
.txt
(tab delimited) format and returns aparser_result
.- Parameters:
files (str, list of str, or file stream) – The name/path of the MaxQuant result file(s) or a file-like object/stream.
crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.
crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.
decoy_prefix (str, default = "REV__") – The prefix that indicates that a protein is from the decoy database.
parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.
modifications (dict of str, float, default =
constants.MODIFICATIONS
) – Mapping of modification names to modification masses.sep (str, default = "t") – Seperator used in the
.txt
file.decimal (str, default = ".") – Character to recognize as decimal point.
- Returns:
The
parser_result
object containing all parsed information.- Return type:
dict
- Raises:
RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.
KeyError – If the specified crosslinker could not be found/mapped.
Warning
MaxLynx/MaxQuant only reports a single protein crosslink position per peptide, for ambiguous peptides only the crosslink position of the first matching protein is reported. All matching proteins can be retrieved via
additional_information
, however not their corresponding crosslink positions. For this reason it is recommended to usetransform.reannotate_positions()
to correctly annotate all crosslink positions for all peptides if that is important for downstream analysis.Examples
>>> from pyXLMS.parser import read_maxquant >>> csms = read_maxquant("data/maxquant/run1/crosslinkMsms.txt")
pyXLMS.parser_xldbse_msannika module#
- pyXLMS.parser_xldbse_msannika.read_msannika(
- files: str | List[str] | BinaryIO,
- parse_modifications: bool = True,
- modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
- format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx', 'pdresult'] = 'auto',
- sep: str = '\t',
- decimal: str = '.',
- unsafe: bool = False,
- verbose: Literal[0, 1, 2] = 1,
Read an MS Annika result file.
Reads an MS Annika crosslink-spectrum-matches result file or crosslink result file in
.csv
or.xlsx
format, or both from a.pdResult
file from Proteome Discover, and returns aparser_result
.- Parameters:
files (str, list of str, or file stream) – The name/path of the MS Annika result file(s) or a file-like object/stream.
parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.
modifications (dict of str, float, default =
constants.MODIFICATIONS
) – Mapping of modification names to modification masses.format ("auto", "csv", "tsv", "txt", "xlsx", or "pdresult", default = "auto") – The format of the result file.
"auto"
is only available if the name/path to the MS Annika result file is given.sep (str, default = "t") – Seperator used in the
.csv
or.tsv
file. Parameter is ignored if the file is in.xlsx
or.pdResult
format.decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in
.xlsx
or.pdResult
format.unsafe (bool, default = False) – If True, allows reading of negative peptide and crosslink positions but replaces their values with None. Negative values occur when peptides can’t be matched to proteins because of ‘X’ in protein sequences. Reannotation might be possible with
transform.reannotate_positions()
.verbose (0, 1, or 2, default = 1) –
0: All warnings are ignored.
1: Warnings are printed to stdout.
2: Warnings are treated as errors.
- Returns:
The
parser_result
object containing all parsed information.- Return type:
dict
- Raises:
ValueError – If the input format is not supported or cannot be inferred.
TypeError – If the pdResult file is provided in the wrong format.
TypeError – If parameter verbose was not set correctly.
RuntimeError – If one of the crosslinks or crosslink-spectrum-matches contains unknown crosslink or peptide positions. This occurs when peptides can’t be matched to proteins because of ‘X’ in protein sequences. Selecting ‘unsafe = True’ will ignore these errors and return None type positions. Reannotation might be possible with
transform.reannotate_positions()
.RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.
KeyError – If one of the found post-translational-modifications could not be found/mapped.
Warning
MS Annika does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation. This also only applies to crosslinks and not crosslink-spectrum-matches, where this information is correctly reported and parsed.
Examples
>>> from pyXLMS.parser import read_msannika >>> csms_from_xlsx = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx")
>>> from pyXLMS.parser import read_msannika >>> crosslinks_from_xlsx = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx")
>>> from pyXLMS.parser import read_msannika >>> csms_from_tsv = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.txt")
>>> from pyXLMS.parser import read_msannika >>> crosslinks_from_tsv = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.txt")
>>> from pyXLMS.parser import read_msannika >>> csms_and_crosslinks_from_pdresult = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1.pdResult")
pyXLMS.parser_xldbse_mzid module#
- pyXLMS.parser_xldbse_mzid.parse_scan_nr_from_mzid(spectrum_id: str) int [source]#
Parse the scan number from a ‘spectrumID’ of a mzIdentML file.
- Parameters:
title (str) – The ‘spectrumID’ of the mass spectrum from an mzIdentML file read with
pyteomics
.- Returns:
The scan number.
- Return type:
int
Examples
>>> from pyXLMS.parser import parse_scan_nr_from_mzid >>> parse_scan_nr_from_mzid("scan=5321") 5321
- pyXLMS.parser_xldbse_mzid.read_mzid(
- files: str | List[str] | BinaryIO,
- scan_nr_parser: Callable[[str], int] | None = None,
- decoy: bool | None = None,
- crosslinkers: Dict[str, float] = {'ADH': 138.09054635, 'BS3': 138.06808, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'PhoX': 209.97181},
- verbose: Literal[0, 1, 2] = 1,
Read a mzIdentML (mzid) file.
Reads crosslink-spectrum-matches from a mzIdentML (mzid) file and returns a
parser_result
.- Parameters:
files (str, list of str, or file stream) – The name/path of the mzIdentML (mzid) file(s) or a file-like object/stream.
scan_nr_parser (callable, or None, default = None) – A function that parses the scan number from mzid spectrumIDs. If None (default) the function
parse_scan_nr_from_mzid()
is used.decoy (bool, or None, default = None) – Whether the mzid file contains decoy CSMs (
True
) or target CSMs (False
).crosslinkers (dict of str, float, default =
constants.CROSSLINKERS
) – Mapping of crosslinker names to crosslinker delta masses.verbose (0, 1, or 2, default = 1) –
0: All warnings are ignored.
1: Warnings are printed to stdout.
2: Warnings are treated as errors.
- Returns:
The
parser_result
object containing all parsed information.- Return type:
dict
- Raises:
RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.
RuntimeError – If parser is used with
verbose = 2
.RuntimeError – If there are warnings while reading the mzIdentML file (only for
verbose = 2
).TypeError – If parameter verbose was not set correctly.
TypeError – If one of the values necessary to create a crosslink-spectrum-match could not be parsed correctly.
Notes
This parser is experimental, as I don’t know if the mzIdentML structure is consistent accross different crosslink search engines. This parser was tested with mzIdentML files from MS Annika and XlinkX.
Warning
This parser only parses minimal data because most information is not available from the mzIdentML file. The available data is:
alpha_peptide
alpha_peptide_crosslink_position
beta_peptide
beta_peptide_crosslink_position
spectrum_file
scan_nr
Examples
>>> from pyXLMS.parser import read_mzid >>> csms = read_mzid("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid")
pyXLMS.parser_xldbse_plink module#
- pyXLMS.parser_xldbse_plink.parse_scan_nr_from_plink(title: str) int [source]#
Parse the scan number from a spectrum title.
- Parameters:
title (str) – The spectrum title.
- Returns:
The scan number.
- Return type:
int
Examples
>>> from pyXLMS.parser import parse_scan_nr_from_plink >>> parse_scan_nr_from_plink("XLpeplib_Beveridge_QEx-HFX_DSS_R1.20588.20588.3.0.dta") 20588
- pyXLMS.parser_xldbse_plink.parse_spectrum_file_from_plink(title: str) str [source]#
Parse the spectrum file name from a spectrum title.
- Parameters:
title (str) – The spectrum title.
- Returns:
The spectrum file name.
- Return type:
str
Examples
>>> from pyXLMS.parser import parse_spectrum_file_from_plink >>> parse_spectrum_file_from_plink("XLpeplib_Beveridge_QEx-HFX_DSS_R1.20588.20588.3.0.dta") 'XLpeplib_Beveridge_QEx-HFX_DSS_R1'
- pyXLMS.parser_xldbse_plink.read_plink(
- files: str | List[str] | BinaryIO,
- spectrum_file_parser: Callable[[str], str] | None = None,
- scan_nr_parser: Callable[[str], int] | None = None,
- decoy_prefix: str = 'REV_',
- parse_modifications: bool = True,
- modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
- sep: str = ',',
- decimal: str = '.',
- verbose: Literal[0, 1, 2] = 1,
Read a pLink result file.
Reads a pLink crosslink-spectrum-matches result file “*cross-linked_spectra.csv” in
.csv
(comma delimited) format and returns aparser_result
.- Parameters:
files (str, list of str, or file stream) – The name/path of the pLink result file(s) or a file-like object/stream.
spectrum_file_parser (callable, or None, default = None) – A function that parses the spectrum file name from spectrum titles. If None (default) the function
parse_spectrum_file_from_plink()
is used.scan_nr_parser (callable, or None, default = None) – A function that parses the scan number from spectrum titles. If None (default) the function
parse_scan_nr_from_plink()
is used.decoy_prefix (str, default = "REV_") – The prefix that indicates that a protein is from the decoy database.
parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.
modifications (dict of str, float, default =
constants.MODIFICATIONS
) – Mapping of modification names to modification masses.sep (str, default = ",") – Seperator used in the
.csv
file.decimal (str, default = ".") – Character to recognize as decimal point.
verbose (0, 1, or 2, default = 1) –
0: All warnings are ignored.
1: Warnings are printed to stdout.
2: Warnings are treated as errors.
- Returns:
The
parser_result
object containing all parsed information.- Return type:
dict
- Raises:
RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.
TypeError – If parameter verbose was not set correctly.
Warning
Target and decoy information is derived based off the protein accession and parameter
decoy_prefix
. By default, pLink only reports target matches that are above the desired FDR.Examples
>>> from pyXLMS.parser import read_plink >>> csms = read_plink("data/plink2/Cas9_plus10_2024.06.20.filtered_cross-linked_spectra.csv")
pyXLMS.parser_xldbse_scout module#
- pyXLMS.parser_xldbse_scout.detect_scout_filetype(
- data: DataFrame,
Detects the Scout-related source of the data.
Detects whether the input data is unfiltered crosslink-spectrum-matches, filtered crosslink-spectrum-matches, or crosslinks from Scout.
- Parameters:
data (pd.DataFrame) – The input data originating from Scout.
- Returns:
“scout_csms_unfiltered” if a Scout unfiltered CSMs file was read, “scout_csms_filtered” if a Scout filtered CSMs file was read, “scout_xl” if a Scout crosslink/residue pair result file was read.
- Return type:
str
- Raises:
ValueError – If the data source could not be determined.
Examples
>>> from pyXLMS.parser import detect_scout_filetype >>> import pandas as pd >>> df1 = pd.read_csv("data/scout/Cas9_Unfiltered_CSMs.csv") >>> detect_scout_filetype(df1) 'scout_csms_unfiltered'
>>> from pyXLMS.parser import detect_scout_filetype >>> import pandas as pd >>> df2 = pd.read_csv("data/scout/Cas9_Filtered_CSMs.csv") >>> detect_scout_filetype(df2) 'scout_csms_filtered'
>>> from pyXLMS.parser import detect_scout_filetype >>> import pandas as pd >>> df3 = pd.read_csv("data/scout/Cas9_Residue_Pairs.csv") >>> detect_scout_filetype(df3) 'scout_xl'
- pyXLMS.parser_xldbse_scout.parse_modifications_from_scout_sequence(
- seq: str,
- crosslink_position: int,
- crosslinker: str,
- crosslinker_mass: float,
- modifications: Dict[str, Tuple[str, float]] = {'+15.994900': ('Oxidation', 15.994915), '+57.021460': ('Carbamidomethyl', 57.021464), 'ADH': ('ADH', 138.09054635), 'BS3': ('BS3', 138.06808), 'Carbamidomethyl': ('Carbamidomethyl', 57.021464), 'DSBSO': ('DSBSO', 308.03883), 'DSBU': ('DSBU', 196.08479231), 'DSS': ('DSS', 138.06808), 'DSSO': ('DSSO', 158.00376), 'Oxidation of Methionine': ('Oxidation', 15.994915), 'PhoX': ('PhoX', 209.97181)},
- verbose: Literal[0, 1, 2] = 1,
Parse post-translational-modifications from a Scout peptide sequence.
Parses post-translational-modifications (PTMs) from a Scout peptide sequence, for example “M(+15.994900)LASAGELQKGNELALPSK”.
- Parameters:
seq (str) – The Scout sequence string.
crosslink_position (int) – Position of the crosslinker in the sequence (1-based).
crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.
crosslinker_mass (float) – Monoisotopic delta mass of the crosslink modification.
modifications (dict of str, float, default =
constants.SCOUT_MODIFICATION_MAPPING
) – Mapping of modification names to modification masses.verbose (0, 1, or 2, default = 1) –
0: All warnings are ignored.
1: Warnings are printed to stdout.
2: Warnings are treated as errors.
- Returns:
The
pyXLMS
specific modifications object, a dictionary that maps positions to their corresponding modifications and their monoisotopic masses.- Return type:
dict of int, tuple
- Raises:
RuntimeError – If multiple modifications on the same residue are parsed (only if
verbose = 2
).KeyError – If an unknown modification is encountered.
Examples
>>> from pyXLMS.parser import parse_modifications_from_scout_sequence >>> seq = "M(+15.994900)LASAGELQKGNELALPSK" >>> parse_modifications_from_scout_sequence(seq, 10, "DSS", 138.06808) {10: ('DSS', 138.06808), 1: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_scout_sequence >>> seq = "KIEC(+57.021460)FDSVEISGVEDR" >>> parse_modifications_from_scout_sequence(seq, 1, "DSS", 138.06808) {1: ('DSS', 138.06808), 4: ('Carbamidomethyl', 57.021464)}
- pyXLMS.parser_xldbse_scout.read_scout(
- files: str | List[str] | BinaryIO,
- crosslinker: str,
- crosslinker_mass: float | None = None,
- parse_modifications: bool = True,
- modifications: Dict[str, Tuple[str, float]] = {'+15.994900': ('Oxidation', 15.994915), '+57.021460': ('Carbamidomethyl', 57.021464), 'ADH': ('ADH', 138.09054635), 'BS3': ('BS3', 138.06808), 'Carbamidomethyl': ('Carbamidomethyl', 57.021464), 'DSBSO': ('DSBSO', 308.03883), 'DSBU': ('DSBU', 196.08479231), 'DSS': ('DSS', 138.06808), 'DSSO': ('DSSO', 158.00376), 'Oxidation of Methionine': ('Oxidation', 15.994915), 'PhoX': ('PhoX', 209.97181)},
- sep: str = ',',
- decimal: str = '.',
- verbose: Literal[0, 1, 2] = 1,
Read a Scout result file.
Reads a Scout filtered or unfiltered crosslink-spectrum-matches result file or crosslink/residue pair result file in
.csv
format and returns aparser_result
.- Parameters:
files (str, list of str, or file stream) – The name/path of the Scout result file(s) or a file-like object/stream.
crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.
crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.
parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.
modifications (dict of str, tuple, default =
constants.SCOUT_MODIFICATION_MAPPING
) – Mapping of Scout sequence elements (e.g."+15.994900"
) and modifications (e.g"Oxidation of Methionine"
) to their modifications (e.g.("Oxidation", 15.994915)
).sep (str, default = ",") – Seperator used in the
.csv
file.decimal (str, default = ".") – Character to recognize as decimal point.
verbose (0, 1, or 2, default = 1) –
0: All warnings are ignored.
1: Warnings are printed to stdout.
2: Warnings are treated as errors.
- Returns:
The
parser_result
object containing all parsed information.- Return type:
dict
- Raises:
RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.
KeyError – If the specified crosslinker could not be found/mapped.
TypeError – If parameter verbose was not set correctly.
Warning
When reading unfiltered crosslink-spectrum-matches, no protein crosslink positions or protein peptide positions are available, as these are not reported. If needed they should be annotated with
transform.reannotate_positions()
.When reading filtered crosslink-spectrum-matches, Scout does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink-spectrum-match are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink-spectrum-match. This leads to only TT and DD matches, which needs to be considered for FDR estimation.
When reading crosslinks / residue pairs, Scout does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation.
Examples
>>> from pyXLMS.parser import read_scout >>> csms_unfiltered = read_scout("data/scout/Cas9_Unfiltered_CSMs.csv")
>>> from pyXLMS.parser import read_scout >>> csms_filtered = read_scout("data/scout/Cas9_Filtered_CSMs.csv")
>>> from pyXLMS.parser import read_scout >>> crosslinks = read_scout("data/scout/Cas9_Residue_Pairs.csv")
pyXLMS.parser_xldbse_xi module#
- pyXLMS.parser_xldbse_xi.detect_xi_filetype(
- data: DataFrame,
Detects the xi-related source (application) of the data.
Detects whether the input data is originating from xiSearch or xiFDR, and if xiFDR which type of data is being read (crosslink-spectrum-matches or crosslinks).
- Parameters:
data (pd.DataFrame) – The input data originating from xiSearch or xiFDR.
- Returns:
“xisearch” if a xiSearch result file was read, “xifdr_csms” if CSMs from xiFDR were read, “xifdr_crosslinks” if crosslinks from xiFDR were read.
- Return type:
str
- Raises:
ValueError – If the data source could not be determined.
Examples
>>> from pyXLMS.parser import detect_xi_filetype >>> import pandas as pd >>> df1 = pd.read_csv("data/xi/r1_Xi1.7.6.7.csv") >>> detect_xi_filetype(df1) 'xisearch'
>>> from pyXLMS.parser import detect_xi_filetype >>> import pandas as pd >>> df2 = pd.read_csv("data/xi/1perc_xl_boost_CSM_xiFDR2.2.1.csv") >>> detect_xi_filetype(df2) 'xifdr_csms'
>>> from pyXLMS.parser import detect_xi_filetype >>> import pandas as pd >>> df3 = pd.read_csv("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv") >>> detect_xi_filetype(df3) 'xifdr_crosslinks'
- pyXLMS.parser_xldbse_xi.parse_modifications_from_xi_sequence(sequence: str) Dict[int, str] [source]#
Parses all post-translational-modifications from a peptide sequence as reported by xiFDR.
Parses all post-translational-modifications from a peptide sequence as reported by xiFDR. This assumes that amino acids are given in upper case letters and post-translational-modifications in lower case letters. The parsed modifications are returned as a dictionary that maps their position in the sequence (1-based) to their xiFDR annotation (
SYMBOLEXT
), for example"cm"
or"ox"
.- Parameters:
sequence (str) – The peptide sequence as given by xiFDR.
- Returns:
Dictionary that maps modifications (values) to their respective positions in the peptide sequence (1-based) (keys). The modifications are given in xiFDR annotation style (
SYMBOLEXT
) which is the lower letter modification code, for example"cm"
for carbamidomethylation.- Return type:
dict of int, str
- Raises:
RuntimeError – If multiple modifications on the same residue are parsed.
Examples
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence >>> seq1 = "KIECcmFDSVEISGVEDR" >>> parse_modifications_from_xi_sequence(seq1) {4: 'cm'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence >>> seq2 = "KIECcmFDSVEMoxISGVEDR" >>> parse_modifications_from_xi_sequence(seq2) {4: 'cm', 10: 'ox'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence >>> seq3 = "KIECcmFDSVEISGVEDRMox" >>> parse_modifications_from_xi_sequence(seq3) {4: 'cm', 17: 'ox'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence >>> seq4 = "CcmKIECcmFDSVEISGVEDRMox" >>> parse_modifications_from_xi_sequence(seq4) {1: 'cm', 5: 'cm', 18: 'ox'}
- pyXLMS.parser_xldbse_xi.parse_peptide(sequence: str, term_char: str = '.') str [source]#
Parses the peptide sequence from a sequence string including flanking amino acids.
Parses the peptide sequence from a sequence string including flanking amino acids, for example
"K.KKMoxKLS.S"
. The returned peptide sequence for this example would be"KKMoxKLS"
.- Parameters:
sequence (str) – The sequence string containing the peptide sequence and flanking amino acids.
term_char (str (single character), default = ".") – The character used to denote N-terminal and C-terminal.
- Returns:
The parsed peptide sequence without flanking amino acids.
- Return type:
str
- Raises:
RuntimeError – If (one of) the peptide sequence(s) could not be parsed.
Examples
>>> from pyXLMS.parser import parse_peptide >>> parse_peptide("K.KKMoxKLS.S") 'KKMoxKLS'
>>> from pyXLMS.parser import parse_peptide >>> parse_peptide("-.CcmCcmPSR.T") 'CcmCcmPSR'
>>> from pyXLMS.parser import parse_peptide >>> parse_peptide("CCPSR") 'CCPSR'
- pyXLMS.parser_xldbse_xi.read_xi(
- files: str | List[str] | BinaryIO,
- decoy_prefix: str | None = 'auto',
- parse_modifications: bool = True,
- modifications: Dict[str, Tuple[str, float]] = {'->': ('Substitution', nan), 'bs3_ami': ('BS3 Amidated', 155.094619105), 'bs3_hyd': ('BS3 Hydrolized', 156.0786347), 'bs3_tris': ('BS3 Tris', 259.141973), 'bs3loop': ('BS3 Looplink', 138.06808), 'bs3nh2': ('BS3 Amidated', 155.094619105), 'bs3oh': ('BS3 Hydrolized', 156.0786347), 'cm': ('Carbamidomethyl', 57.021464), 'dsbu_ami': ('DSBU Amidated', 213.111341), 'dsbu_hyd': ('DSBU Hydrolized', 214.095357), 'dsbu_loop': ('DSBU Looplink', 196.08479231), 'dsbu_tris': ('DSBU Tris', 317.158685), 'dsbuloop': ('DSBU Looplink', 196.08479231), 'dsso_ami': ('DSSO Amidated', 175.030313905), 'dsso_hyd': ('DSSO Hydrolized', 176.0143295), 'dsso_loop': ('DSSO Looplink', 158.00376), 'dsso_tris': ('DSSO Tris', 279.077658), 'dssoloop': ('DSSO Looplink', 158.00376), 'ox': ('Oxidation', 15.994915)},
- sep: str = ',',
- decimal: str = '.',
- ignore_errors: bool = False,
- verbose: Literal[0, 1, 2] = 1,
Read a xiSearch/xiFDR result file.
Reads a xiSearch crosslink-spectrum-matches result file or a xiFDR crosslink-spectrum-matches result file or crosslink result file in
.csv
format and returns aparser_result
.- Parameters:
files (str, list of str, or file stream) – The name/path of the xiSearch/xiFDR result file(s) or a file-like object/stream.
decoy_prefix (str, or None, default = "auto") – The prefix that indicates that a protein is from the decoy database. If “auto” or None it will use the default for each xi file type.
parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.
modifications (dict of str, tuple, default =
constants.XI_MODIFICATION_MAPPING
) – Mapping of xi sequence elements (e.g."cm"
) to their modifications (e.g.("Carbamidomethyl", 57.021464)
). This corresponds to theSYMBOLEXT
field, or theSYMBOL
field minus the amino acid in the xiSearch config.sep (str, default = ",") – Seperator used in the
.csv
file.decimal (str, default = ".") – Character to recognize as decimal point.
ignore_errors (bool, default = False) – If modifications that are not given in parameter ‘modifications’ should raise an error or not. By default an error is raised if an unknown modification is encountered. If
True
modifications that are unknown are encoded with the xi shortcode (SYMBOLEXT
) andfloat("nan")
modification mass.verbose (0, 1, or 2, default = 1) –
0: All warnings are ignored.
1: Warnings are printed to stdout.
2: Warnings are treated as errors.
- Returns:
The
parser_result
object containing all parsed information.- Return type:
dict
- Raises:
RuntimeError – If the file(s) contain no crosslinks or crosslink-spectrum-matches.
TypeError – If parameter verbose was not set correctly.
Examples
>>> from pyXLMS.parser import read_xi >>> csms_from_xiSearch = read_xi("data/xi/r1_Xi1.7.6.7.csv")
>>> from pyXLMS.parser import read_xi >>> csms_from_xiFDR = read_xi("data/xi/1perc_xl_boost_CSM_xiFDR2.2.1.csv")
>>> from pyXLMS.parser import read_xi >>> crosslinks_from_xiFDR = read_xi("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv")
pyXLMS.parser_xldbse_xlinkx module#
- pyXLMS.parser_xldbse_xlinkx.read_xlinkx(
- files: str | List[str] | BinaryIO,
- decoy: bool | None = None,
- parse_modifications: bool = True,
- modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
- format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx', 'pdresult'] = 'auto',
- sep: str = '\t',
- decimal: str = '.',
- ignore_errors: bool = False,
- verbose: Literal[0, 1, 2] = 1,
Read an XlinkX result file.
Reads an XlinkX crosslink-spectrum-matches result file or crosslink result file in
.csv
or.xlsx
format, or both from a.pdResult
file from Proteome Discover, and returns aparser_result
.- Parameters:
files (str, list of str, or file stream) – The name/path of the XlinkX result file(s) or a file-like object/stream.
decoy (bool, or None) – Default decoy value to use if no decoy value is found. Only used if the “Is Decoy” column is not found in the supplied data.
parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.
modifications (dict of str, float, default =
constants.MODIFICATIONS
) – Mapping of modification names to modification masses.format ("auto", "csv", "tsv", "txt", "xlsx", or "pdresult", default = "auto") – The format of the result file.
"auto"
is only available if the name/path to the XlinkX result file is given.sep (str, default = "t") – Seperator used in the
.csv
or.tsv
file. Parameter is ignored if the file is in.xlsx
or.pdResult
format.decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in
.xlsx
or.pdResult
format.ignore_errors (bool, default = False) – If missing crosslink positions should raise an error or not. Setting this to True will suppress the
RuntimeError
for the crosslink position not being able to be parsed for at least one of the crosslinks. For these cases the crosslink position will be set to 100 000.verbose (0, 1, or 2, default = 1) –
0: All warnings are ignored.
1: Warnings are printed to stdout.
2: Warnings are treated as errors.
- Returns:
The
parser_result
object containing all parsed information.- Return type:
dict
- Raises:
ValueError – If the input format is not supported or cannot be inferred.
TypeError – If parameter verbose was not set correctly.
TypeError – If the pdResult file is provided in the wrong format.
RuntimeError – If the crosslink position could not be parsed for at least one of the crosslinks.
RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.
KeyError – If one of the found post-translational-modifications could not be found/mapped.
Warning
XlinkX does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation. This applies to both crosslinks and crosslink-spectrum-matches.
Examples
>>> from pyXLMS.parser import read_xlinkx >>> csms_from_xlsx = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_CSMs.xlsx")
>>> from pyXLMS.parser import read_xlinkx >>> crosslinks_from_xlsx = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_Crosslinks.xlsx")
>>> from pyXLMS.parser import read_xlinkx >>> csms_from_tsv = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_CSMs.txt")
>>> from pyXLMS.parser import read_xlinkx >>> crosslinks_from_tsv = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_Crosslinks.txt")
>>> from pyXLMS.parser import read_xlinkx >>> csms_and_crosslinks_from_pdresult = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3.pdResult")
pyXLMS.pipelines module#
- pyXLMS.pipelines.pipeline(
- files: str | List[str] | BinaryIO,
- engine: Literal['Custom', 'MaxQuant', 'MaxLynx', 'MS Annika', 'mzIdentML', 'pLink', 'Scout', 'xiSearch/xiFDR', 'XlinkX'],
- crosslinker: str,
- unique: bool | Dict[str, Any] | None = True,
- validate: bool | Dict[str, Any] | None = True,
- targets_only: bool | None = True,
- **kwargs,
Runs a standard down-stream analysis pipeline for crosslinks and crosslink-spectrum-matches.
Runs a standard down-stream analysis pipeline for crosslinks and crosslink-spectrum-matches. The pipeline first reads a result file and subsequently optionally filters the the read data for unique crosslinks and crosslink-spectrum-matches, optionally the data is validated by false discovery rate estimation and - also optionally - only target-target matches are returned. Internally the pipeline calls
parser.read()
,transform.unique()
,transform.validate()
, andtransform.targets_only()
.- Parameters:
files (str, list of str, or file stream) – The name/path of the result file(s) or a file-like object/stream.
engine ("Custom", "MaxQuant", "MaxLynx", "MS Annika", "mzIdentML", "pLink", "Scout", "xiSearch/xiFDR", or "XlinkX") – Crosslink search engine or format of the result file.
crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.
unique (dict of str, any, or bool, or None, default = True) – If
transform.unique()
should be run in the pipeline. If None or False this step is omitted. If True this step is run with default parameters. If a dictionary is given it should contain parameters for runningtransform.unique()
. Omitting a parameter in the dictionary will fall back to its default value.validate (dict of str, any, or bool, or None, default = True) – If
transform.validate()
should be run in the pipeline. If None or False this step is omitted. If True this step is run with default parameters. If a dictionary is given it should contain parameters for runningtransform.validate()
. Omitting a parameter in the dictionary will fall back to its default value.targets_only (bool, or None, default = True) – If
transform.targets_only()
should be run in the pipeline. If None or False this step is omitted.**kwargs – Any additional parameters will be passed to the specific result file parsers.
- Returns:
The transformed parser_result after all pipeline steps are completed.
- Return type:
dict of str, any
- Raises:
TypeError – If any of the parameters do not have the correct type.
Notes
Various helpful pipeline information is also printed to
stdout
.Examples
>>> from pyXLMS.pipelines import pipeline >>> pr = pipeline("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", ... engine="MS Annika", ... crosslinker="DSS", ... unique=True, ... validate={"fdr": 0.05, "formula":"(TD-DD)/TT"}, ... targets_only=True) Reading MS Annika CSMs...: 100%|██████████████████████████████████████████████████| 826/826 [00:00<00:00, 10337.98it/s] ---- Summary statistics before pipeline ---- Number of CSMs: 826.0 Number of unique CSMs: 826.0 Number of intra CSMs: 803.0 Number of inter CSMs: 23.0 Number of target-target CSMs: 786.0 Number of target-decoy CSMs: 39.0 Number of decoy-decoy CSMs: 1.0 Minimum CSM score: 1.11 Maximum CSM score: 452.99 Iterating over scores for FDR calculation...: 0%| | 0/826 [00:00<?, ?it/s] ---- Summary statistics after pipeline ---- Number of CSMs: 786.0 Number of unique CSMs: 786.0 Number of intra CSMs: 774.0 Number of inter CSMs: 12.0 Number of target-target CSMs: 786.0 Number of target-decoy CSMs: 0.0 Number of decoy-decoy CSMs: 0.0 Minimum CSM score: 1.28 Maximum CSM score: 452.99 ---- Performed pipeline steps ---- :: parser.read() :: :: parser.read() :: params :: <params omitted> :: transform.unique() :: :: transform.unique() :: params :: by=peptide :: transform.unique() :: params :: score=higher_better :: transform.validate() :: :: transform.validate() :: params :: fdr=0.05 :: transform.validate() :: params :: formula=(TD-DD)/TT :: transform.validate() :: params :: score=higher_better :: transform.validate() :: params :: separate_intra_inter=False :: transform.validate() :: params :: ignore_missing_labels=False :: transform.targets_only() :: :: transform.targets_only() :: params :: no params
pyXLMS.transform module#
pyXLMS.transform_aggregate module#
- pyXLMS.transform_aggregate.aggregate(
- csms: List[Dict[str, Any]],
- by: Literal['peptide', 'protein'] = 'peptide',
- score: Literal['higher_better', 'lower_better'] = 'higher_better',
Aggregate crosslink-spectrum-matches to crosslinks.
Aggregates a list of crosslink-spectrum-matches to unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if
by = "peptide"
, otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the crosslink of the first corresponding crosslink-spectrum -match in the list is kept instead.- Parameters:
csms (list of dict of str, any) – A list of crosslink-spectrum-matches.
by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. If protein crosslink position is not available for all crosslink-spectrum-matches a
ValueError
will be raised. Make sure that all crosslink-spectrum-matches have the_proteins
and_proteins_crosslink_positions
fields set. If this is not already done by the parser, this can be achieved withtransform.reannotate_positions()
.score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.
- Returns:
A list of aggregated, unique crosslinks.
- Return type:
list of dict of str, any
Warning
Aggregation will not conserve false discovery rate (FDR)! Aggregating crosslink-spectrum-matches that are validated for 1% FDR will not result in crosslinks validated for 1% FDR! Aggregated crosslinks should be validated with either external tools or with the built-in
transform.validate()
!- Raises:
TypeError – If a wrong data type is provided.
TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.
TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.
ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import aggregate >>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> aggregate_peptide = aggregate(pr["crosslink-spectrum-matches"], by="peptide") >>> len(aggregate_peptide) 3
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import aggregate >>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> aggregate_protein = aggregate(pr["crosslink-spectrum-matches"], by="protein") >>> len(aggregate_protein) 2
- pyXLMS.transform_aggregate.unique(
- data: List[Dict[str, Any]] | Dict[str, Any],
- by: Literal['peptide', 'protein'] = 'peptide',
- score: Literal['higher_better', 'lower_better'] = 'higher_better',
Filter for unique crosslinks or crosslink-spectrum-matches.
Filters for unique crosslinks from a list on non-unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if
by = "peptide"
, otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslinks without scores are provided, the first crosslink in the list is kept instead.or
Filters for unique crosslink-spectrum-matches from a list on non-unique crosslink-spectrum-matches. A crosslink- spectrum-match is considered unique if there is no other crosslink-spectrum-match from the same spectrum file and with the same scan number. If more than one crosslink-spectrum-match exists per spectrum file and scan number, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the first crosslink-spectrum-match in the list is kept instead.
- Parameters:
data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to filter, or a parser_result.
by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. Only affects filtering for unique crosslinks and not crosslink-spectrum-matches. If protein crosslink position is not available for all crosslinks a
ValueError
will be raised. Make sure that all crosslinks have the_proteins
and_proteins_crosslink_positions
fields set. If this is not already done by the parser, this can be achieved withtransform.reannotate_positions()
.score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of unique crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with unique crosslink-spectrum-matches and/or unique crosslinks will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.
TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.
ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import unique >>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> len(pr["crosslinks"]) 10 >>> unique_peptide = unique(pr, by="peptide") >>> len(unique_peptide["crosslink-spectrum-matches"]) 5 >>> len(unique_peptide["crosslinks"]) 3
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import unique >>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 10 >>> len(pr["crosslinks"]) 10 >>> unique_protein = unique(pr, by="protein") >>> len(unique_protein["crosslink-spectrum-matches"]) 5 >>> len(unique_protein["crosslinks"]) 2
pyXLMS.transform_filter module#
- pyXLMS.transform_filter.filter_crosslink_type(
- data: List[Dict[str, Any]],
Separate crosslinks and crosslink-spectrum-matches by their crosslink type.
Gets all crosslinks or crosslink-spectrum-matches depending on crosslink type. Will separate based on if a crosslink or crosslink-spectrum-match is of type “intra” or “inter” crosslink.
- Parameters:
data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.
- Returns:
Returns a dictionary with key
Intra
which contains all crosslinks or crosslink-spectrum- matches with crosslink type = “intra”, and keyInter
which contains all crosslinks or crosslink-spectrum-matches with crosslink type = “inter”.- Return type:
dict of str, list of dict
- Raises:
TypeError – If an unsupported data type is provided.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_crosslink_type >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> crosslink_type_filtered_csms = filter_crosslink_type(result["crosslink-spectrum-matches"]) >>> len(crosslink_type_filtered_csms["Intra"]) 803 >>> len(crosslink_type_filtered_csms["Inter"]) 23
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_crosslink_type >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> crosslink_type_filtered_crosslinks = filter_crosslink_type(result["crosslinks"]) >>> len(crosslink_type_filtered_crosslinks["Intra"]) 279 >>> len(crosslink_type_filtered_crosslinks["Inter"]) 21
- pyXLMS.transform_filter.filter_proteins(
- data: List[Dict[str, Any]],
- proteins: Set[str] | List[str],
Get all crosslinks or crosslink-spectrum-matches originating from proteins of interest.
Gets all crosslinks or crosslink-spectrum-matches originating from a list of proteins of interest and returns a list of crosslinks or crosslink-spectrum-matches where both peptides come from a protein of interest and a list of crosslinks or crosslink-spectrum-matches where one of the peptides comes from a protein of interest.
- Parameters:
data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.
proteins (set of str, or list of str) – A set of protein accessions of interest.
- Returns:
Returns a dictionary with key
Proteins
which contains the list of proteins of interest, keyBoth
which contains all crosslinks or crosslink-spectrum-matches where both peptides are originating from a protein of interest, and keyOne
which contains all crosslinks or crosslink-spectrum-matches where one of the two peptides is originating from a protein of interest.- Return type:
dict
- Raises:
TypeError – If an unsupported data type is provided.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_proteins >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> proteins_csms = filter_proteins(result["crosslink-spectrum-matches"], ["Cas9"]) >>> proteins_csms["Proteins"] ['Cas9'] >>> len(proteins_csms["Both"]) 798 >>> len(proteins_csms["One"]) 23
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_proteins >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> proteins_xls = filter_proteins(result["crosslinks"], ["Cas9"]) >>> proteins_xls["Proteins"] ['Cas9'] >>> len(proteins_xls["Both"]) 274 >>> len(proteins_xls["One"]) 21
- pyXLMS.transform_filter.filter_target_decoy(
- data: List[Dict[str, Any]],
Seperate crosslinks or crosslink-spectrum-matches based on target and decoy matches.
Seperates crosslinks or crosslink-spectrum-matches based on if both peptides match to the target database, or if both match to the decoy database, or if one of them matches to the target database and the other to the decoy database. The first we denote as “Target-Target” or “TT” matches, the second as “Decoy-Decoy” or “DD” matches, and the third as “Target-Decoy” or “TD” matches.
- Parameters:
data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.
- Returns:
Returns a dictionary with key
Target-Target
which contains all TT matches, keyTarget-Decoy
which contains all TD matches, and keyDecoy-Decoy
which contains all DD matches.- Return type:
dict
- Raises:
TypeError – If an unsupported data type is provided.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_target_decoy >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> target_and_decoys = filter_target_decoy(result["crosslink-spectrum-matches"]) >>> len(target_and_decoys["Target-Target"]) 786 >>> len(target_and_decoys["Target-Decoy"]) 39 >>> len(target_and_decoys["Decoy-Decoy"]) 1
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import filter_target_decoy >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> target_and_decoys = filter_target_decoy(result["crosslinks"]) >>> len(target_and_decoys["Target-Target"]) 265 >>> len(target_and_decoys["Target-Decoy"]) 0 >>> len(target_and_decoys["Decoy-Decoy"]) 35
pyXLMS.transform_reannotate_positions module#
- pyXLMS.transform_reannotate_positions.fasta_title_to_accession(title: str) str [source]#
Parses the protein accession from a UniProt-like title.
- Parameters:
title (str) – Fasta title/header.
- Returns:
The protein accession parsed from the title. If parsing was unsuccessful the full title is returned.
- Return type:
str
Examples
>>> from pyXLMS.transform import fasta_title_to_accession >>> title = "sp|A0A087X1C5|CP2D7_HUMAN Putative cytochrome P450 2D7 OS=Homo sapiens OX=9606 GN=CYP2D7 PE=5 SV=1" >>> fasta_title_to_accession(title) 'A0A087X1C5'
>>> from pyXLMS.transform import fasta_title_to_accession >>> title = "Cas9" >>> fasta_title_to_accession(title) 'Cas9'
- pyXLMS.transform_reannotate_positions.reannotate_positions(
- data: List[Dict[str, Any]] | Dict[str, Any],
- fasta: str | BinaryIO,
- title_to_accession: Callable[[str], str] | None = None,
Reannotates protein crosslink positions for a given fasta file.
Reannotates the crosslink and peptide positions of the given cross-linked peptide pair and the specified fasta file. Takes a list of crosslink-spectrum-matches or crosslinks, or a parser_result as input.
- Parameters:
data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to annotate, or a parser_result.
fasta (str, or file stream) – The name/path of the fasta file containing protein sequences or a file-like object/stream.
title_to_accession (callable, or None, default = None) – A function that parses the protein accession from the fasta title/header. If None (default) the function
fasta_title_to_accession
is used.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of annotated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an annotated parser_result will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
Examples
>>> from pyXLMS.data import create_crosslink_min >>> from pyXLMS.transform import reannotate_positions >>> xls = [create_crosslink_min("ADANLDK", 7, "GNTDRHSIK", 9)] >>> xls = reannotate_positions(xls, "data/_fasta/Cas9_plus10.fasta") >>> xls[0]["alpha_proteins"] ["Cas9"] >>> xls[0]["alpha_proteins_crosslink_positions"] [1293] >>> xls[0]["beta_proteins"] ["Cas9"] >>> xls[0]["beta_proteins_crosslink_positions"] [48]
pyXLMS.transform_summary module#
- pyXLMS.transform_summary.summary(
- data: List[Dict[str, Any]] | Dict[str, Any],
Extracts summary stats from a list of crosslinks or crosslink-spectrum-matches, or a parser_result.
Extracts summary statistics from a list of crosslinks or crosslink-spectrum-matches, or a parser_result. The statistic depend on the supplied data type, if a list of crosslinks is supplied a dictionary with the following statistics and keys is returned:
Number of crosslinks
Number of unique crosslinks by peptide
Number of unique crosslinks by protein
Number of intra crosslinks
Number of inter crosslinks
Number of target-target crosslinks
Number of target-decoy crosslinks
Number of decoy-decoy crosslinks
Minimum crosslink score
Maximum crosslink score
If a list of crosslink-spectrum-matches is supplied dictionary with the following statistics and keys is returned:
Number of CSMs
Number of unique CSMs
Number of intra CSMs
Number of inter CSMs
Number of target-target CSMs
Number of target-decoy CSMs
Number of decoy-decoy CSMs
Minimum CSM score
Maximum CSM score
If a parser_result is supplied, a dictionary with both containing all of these is returned - if they are available. A parser_result that only contains crosslinks will only yield a dictionary with crosslink statistics, and vice versa a parser_result that only contains crosslink-spectrum-matches will only yield a dictionary with crosslink-spectrum- match statistics. If the parser_result result contains both, then both dictionaries will be merged and returned. Please note that in this case a single dictionary is returned, that contains both the keys for crosslinks and crosslink-spectrum-matches.
Statistics are also printed to
stdout
.- Parameters:
data (list of dict of str, any, or dict of str, any) – A list of crosslinks or crosslink-spectrum-matches, or a parser_result.
- Returns:
A dictionary with summary statistics.
- Return type:
dict of str, float
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import summary >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> csms = pr["crosslink-spectrum-matches"] >>> stats = summary(csms) Number of CSMs: 826.0 Number of unique CSMs: 826.0 Number of intra CSMs: 803.0 Number of inter CSMs: 23.0 Number of target-target CSMs: 786.0 Number of target-decoy CSMs: 39.0 Number of decoy-decoy CSMs: 1.0 Minimum CSM score: 1.11 Maximum CSM score: 452.99
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import summary >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> stats = summary(pr) Number of crosslinks: 300.0 Number of unique crosslinks by peptide: 300.0 Number of unique crosslinks by protein: 298.0 Number of intra crosslinks: 279.0 Number of inter crosslinks: 21.0 Number of target-target crosslinks: 265.0 Number of target-decoy crosslinks: 0.0 Number of decoy-decoy crosslinks: 35.0 Minimum crosslink score: 1.11 Maximum crosslink score: 452.99
pyXLMS.transform_targets_only module#
- pyXLMS.transform_targets_only.targets_only(
- data: List[Dict[str, Any]] | Dict[str, Any],
Get target crosslinks or crosslink-spectrum-matches.
Get target crosslinks or crosslink-spectrum-matches from a list of target and decoy crosslinks or crosslink-spectrum-matches, or a parser_result. This effectively filters out any target-decoy and decoy-decoy matches and is essentially a convenience wrapper for
transform.filter_target_decoy()["Target-Target"]
.- Parameters:
data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks, or a parser_result.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of target crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with target crosslink-spectrum-matches and/or target crosslinks will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
RuntimeError – If no target crosslinks or crosslink-spectrum-matches were found.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> targets = targets_only(result["crosslink-spectrum-matches"]) >>> len(targets) 786
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS") >>> targets = targets_only(result["crosslinks"]) >>> len(targets) 265
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import targets_only >>> result = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS") >>> result_targets = targets_only(result) >>> len(result_targets["crosslink-spectrum-matches"]) 786 >>> len(result_targets["crosslinks"]) 265
pyXLMS.transform_to_dataframe module#
- pyXLMS.transform_to_dataframe.to_dataframe(
- data: List[Dict[str, Any]],
Returns a pandas DataFrame of the given crosslinks or crosslink-spectrum-matches.
- Parameters:
data (list) – A list of crosslinks or crosslink-spectrum-matches as created by
data.create_crosslink()
ordata.create_csm()
.- Returns:
The pandas DataFrame created from the list of input crosslinks or crosslink-spectrum-matches. A full specification of the returned DataFrame can be found in the docs.
- Return type:
pandas.DataFrame
- Raises:
TypeError – If the list does not contain crosslinks or crosslink-spectrum-matches.
ValueError – If the list does not contain any objects.
Examples
>>> from pyXLMS.transform import to_dataframe >>> # assume that crosslinks is a list of crosslinks created by data.create_crosslink() >>> crosslink_dataframe = to_dataframe(crosslinks) >>> # assume csms is a list of crosslink-spectrum-matches created by data.create_csm() >>> csm_dataframe = to_dataframe(csms)
pyXLMS.transform_to_proforma module#
- pyXLMS.transform_to_proforma.to_proforma(
- data: Dict[str, Any] | List[Dict[str, Any]],
- crosslinker: str | float | None = None,
Returns the Proforma string for a single crosslink or crosslink-spectrum-match, or for a list of crosslinks or crosslink-spectrum-matches.
- Parameters:
data (dict of str, any, or list of dict of str, any) – A pyXLMS crosslink object, e.g. see
data.create_crosslink()
. Or a pyXLMS crosslink-spectrum-match object, e.g. seedata.create_csm()
. Alternatively, a list of crosslinks or crosslink-spectrum-matches can also be provided.crosslinker (str, or float, or None, default = None) – Optional name or mass of the crosslink reagent. If the name is given, it should be a valid name from XLMOD. If the crosslink modification is contained in the crosslink-spectrum-match object this parameter has no effect.
- Returns:
The Proforma string of the crosslink or crosslink-spectrum-match. If a list was provided a list containing all Proforma strings is returned.
- Return type:
str
- Raises:
TypeError – If an unsupported data type is provided.
Notes
Modifications with unknown mass are skipped.
If no modifications are given, only the crosslink modification will be encoded in the Proforma.
If no modifications are given and no crosslinker is given, the unmodified peptide Proforma will be returned.
Examples
>>> from pyXLMS.data import create_crosslink_min >>> from pyXLMS.transform import to_proforma >>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1) >>> to_proforma(xl) 'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_crosslink_min >>> from pyXLMS.transform import to_proforma >>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1) >>> to_proforma(xl, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1) >>> to_proforma(csm) 'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)}) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)}, charge=3) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE/3'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3) >>> to_proforma(csm) 'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'
>>> from pyXLMS.data import create_csm_min >>> from pyXLMS.transform import to_proforma >>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3) >>> to_proforma(csm, crosslinker="Xlink:DSSO") 'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'
pyXLMS.transform_util module#
- pyXLMS.transform_util.assert_data_type_same(data_list: List[Dict[str, Any]]) bool [source]#
Checks that all data is of the same data type.
Verifies that all elements in the provided list are of the same data type.
- Parameters:
data_list (list of dict of str, any) – A list of dictionaries with the
data_type
key.- Returns:
If all elements are of the same data type.
- Return type:
bool
Examples
>>> from pyXLMS.transform import assert_data_type_same >>> from pyXLMS import data >>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)] >>> assert_data_type_same(data_list) True
>>> from pyXLMS.transform import assert_data_type_same >>> from pyXLMS import data >>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_csm_min("KPEP", 1, "PEKP", 3, "RUN_1", 1)] >>> assert_data_type_same(data_list) False
- pyXLMS.transform_util.get_available_keys(
- data_list: List[Dict[str, Any]],
Checks which data is available from a list of crosslinks or crosslink-spectrum-matches.
Verifies which data fields have been set for all crosslinks or crosslink-spectrum-matches in the given list. Will return a dictionary structured the same as a crosslink or crosslink-spectrum-match, but instead of the data it will return either True or False, depending if the field was set or not.
- Parameters:
data_list (list of dict of str, any) – A list of crosslinks or crosslink-spectrum-matches.
- Returns:
If a list of crosslinks was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslinks in
data_list
. Keys:data_type
,completeness
,alpha_peptide
,alpha_peptide_crosslink_position
,alpha_proteins
,alpha_proteins_crosslink_positions
,alpha_decoy
,beta_peptide
,beta_peptide_crosslink_position
,beta_proteins
,beta_proteins_crosslink_positions
,beta_decoy
,crosslink_type
,score
, andadditional_information
.If a list of crosslink-spectrum-matches was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslink-spectrum-matches in
data_list
. Keys:data_type
,completeness
,alpha_peptide
,alpha_modifications
,alpha_peptide_crosslink_position
,alpha_proteins
,alpha_proteins_crosslink_positions
,alpha_proteins_peptide_positions
,alpha_score
,alpha_decoy
,beta_peptide
,beta_modifications
,beta_peptide_crosslink_position
,beta_proteins
,beta_proteins_crosslink_positions
,beta_proteins_peptide_positions
,beta_score
,beta_decoy
,crosslink_type
,score
,spectrum_file
,scan_nr
,retention_time
,ion_mobility
, andadditional_information
.
- Return type:
dict of str, bool
- Raises:
TypeError – If not all elements in
data_list
are of the same data type.TypeError – If one or more elements in the list are of an unsupported data type.
Examples
>>> from pyXLMS.transform import get_available_keys >>> from pyXLMS import data >>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)] >>> available_keys = get_available_keys(data_list) >>> available_keys["alpha_peptide"] True >>> available_keys["score"] False
- pyXLMS.transform_util.modifications_to_str(
- modifications: Dict[int, Tuple[str, float]] | None,
Returns the string representation of a modifications dictionary.
- Parameters:
modifications (dict of [str, tuple], or None) – The modifications of a peptide given as a dictionary that maps peptide position (1-based) to modification given as a tuple of modification name and modification delta mass.
N-terminal
modifications should be denoted with position0
.C-terminal
modifications should be denoted with positionlen(peptide) + 1
.- Returns:
The string representation of the modifications (or
None
if no modification was provided).- Return type:
str, or None
Examples
>>> from pyXLMS.transform import modifications_to_str >>> modifications_to_str({1: ("Oxidation", 15.994915), 5: ("Carbamidomethyl", 57.021464)}) '(1:[Oxidation|15.994915]);(5:[Carbamidomethyl|57.021464])'
pyXLMS.transform_validate module#
- pyXLMS.transform_validate.validate(
- data: List[Dict[str, Any]] | Dict[str, Any],
- fdr: float = 0.01,
- formula: Literal['D/T', '(TD+DD)/TT', '(TD-DD)/TT'] = 'D/T',
- score: Literal['higher_better', 'lower_better'] = 'higher_better',
- separate_intra_inter: bool = False,
- ignore_missing_labels: bool = False,
Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate.
Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate (FDR) using the defined formula. Requires that “score”, “alpha_decoy” and “beta_decoy” fields are set for crosslinks and crosslink-spectrum-matches.
- Parameters:
data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to validate, or a parser_result.
fdr (float, default = 0.01) – The target FDR, must be given as a real number between 0 and 1. The default of 0.01 corresponds to 1% FDR.
formula (str, one of "D/T", "(TD+DD)/TT", or "(TD-DD)/TT", default = "D/T") – Which formula to use to estimate FDR. D and DD denote decoy matches, T and TT denote target matches, and TD denotes target-decoy and decoy-target matches.
score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.
separate_intra_inter (bool, default = False) – If FDR should be estimated separately for intra and inter matches.
ignore_missing_labels (bool, default = False) – If crosslinks and crosslink-spectrum-matches should be ignored if they don’t have target and decoy labels. By default and error is thrown if any unlabelled data is encountered.
- Returns:
If a list of crosslink-spectrum-matches or crosslinks was provided, a list of validated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an parser_result with validated crosslink-spectrum-matches and/or validated crosslinks will be returned.
- Return type:
list of dict of str, any, or dict of str, any
- Raises:
TypeError – If a wrong data type is provided.
TypeError – If parameter formula is not one of ‘D/T’, ‘(TD+DD)/TT’, or ‘(TD-DD)/TT’.
TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.
ValueError – If parameter fdr is outside of the supported range.
ValueError – If attribute ‘score’ is not available for any of the data.
ValueError – If attribute ‘alpha_decoy’ or ‘beta_decoy’ is not available for any of the data and parameter ignore_missing_labels is set to False.
ValueError – If the number of DD matches exceeds the number of TD matches for formula ‘(TD-DD)/TT’. FDR can not be estimated with the formula ‘(TD-DD)/TT’ in these cases.
Notes
Please note that progress bars will usually not complete when running this function. This is by design as it is not necessary to iterate over all scores to estimate FDR.
Examples
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import validate >>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS") >>> csms = pr["crosslink-spectrum-matches"] >>> len(csms) 826 >>> validated = validate(csms) >>> len(validated) 705
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import validate >>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 826 >>> len(pr["crosslinks"]) 300 >>> validated = validate(pr) >>> len(validated["crosslink-spectrum-matches"]) 705 >> len(validated["crosslinks"]) 226
>>> from pyXLMS.parser import read >>> from pyXLMS.transform import validate >>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS") >>> len(pr["crosslink-spectrum-matches"]) 826 >>> len(pr["crosslinks"]) 300 >>> validated = validate(pr, fdr=0.05) >>> len(validated["crosslink-spectrum-matches"]) 825 >> len(validated["crosslinks"]) 260