pyXLMS package#

Submodules#

pyXLMS.constants module#

pyXLMS.constants.AMINO_ACIDS = {'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y'}#

List of valid amino acids.

List of one-letter codes for all valid amino acids.

Examples

>>> from pyXLMS.constants import AMINO_ACIDS
>>> "A" in AMINO_ACIDS
True
>>> "B" in AMINO_ACIDS
False
pyXLMS.constants.AMINO_ACIDS_1TO3 = {'A': 'ALA', 'C': 'CYS', 'D': 'ASP', 'E': 'GLU', 'F': 'PHE', 'G': 'GLY', 'H': 'HIS', 'I': 'ILE', 'K': 'LYS', 'L': 'LEU', 'M': 'MET', 'N': 'ASN', 'P': 'PRO', 'Q': 'GLN', 'R': 'ARG', 'S': 'SER', 'T': 'THR', 'V': 'VAL', 'W': 'TRP', 'Y': 'TYR'}#

Mapping of amino acid 1-letter codes to their 3-letter codes.

Mapping of all amino acid 1-letter codes to their corresponding 3-letter codes.

Examples

>>> from pyXLMS.constants import AMINO_ACIDS_1TO3
>>> AMINO_ACIDS_1TO3["G"]
'GLY'
pyXLMS.constants.AMINO_ACIDS_3TO1 = {'ALA': 'A', 'ARG': 'R', 'ASN': 'N', 'ASP': 'D', 'CYS': 'C', 'GLN': 'Q', 'GLU': 'E', 'GLY': 'G', 'HIS': 'H', 'ILE': 'I', 'LEU': 'L', 'LYS': 'K', 'MET': 'M', 'PHE': 'F', 'PRO': 'P', 'SER': 'S', 'THR': 'T', 'TRP': 'W', 'TYR': 'Y', 'VAL': 'V'}#

Mapping of amino acid 3-letter codes to their 1-letter codes.

Mapping of all amino acid 3-letter codes to their corresponding 1-letter codes.

Examples

>>> from pyXLMS.constants import AMINO_ACIDS_3TO1
>>> AMINO_ACIDS_3TO1["GLY"]
'G'
pyXLMS.constants.CROSSLINKERS = {'ADH': 138.09054635, 'BS3': 138.06808, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'PhoX': 209.97181}#

Dictionary of crosslinkers.

Dictionary of pre-defined crosslinkers that maps crosslinker names to crosslinker delta masses. Currently contains “BS3”, “DSS”, “DSSO”, “ADH”, “DSBSO”, “PhoX”.

Examples

>>> from pyXLMS.constants import CROSSLINKERS
>>> CROSSLINKERS["BS3"]
138.06808
pyXLMS.constants.MODIFICATIONS = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331}#

Dictionary of post-translational-modifications.

Dictionary of pre-defined post-translational-modifications that maps modification names to modification delta masses. Currently contains “Carbamidomethyl”, “Oxidation”, “Phospho”, “Acetyl” and all crosslinkers.

Examples

>>> from pyXLMS.constants import MODIFICATIONS
>>> MODIFICATIONS["Carbamidomethyl"]
57.021464
>>> MODIFICATIONS["BS3"]
138.06808
pyXLMS.constants.SCOUT_MODIFICATION_MAPPING = {'+15.994900': ('Oxidation', 15.994915), '+57.021460': ('Carbamidomethyl', 57.021464), 'ADH': ('ADH', 138.09054635), 'BS3': ('BS3', 138.06808), 'Carbamidomethyl': ('Carbamidomethyl', 57.021464), 'DSBSO': ('DSBSO', 308.03883), 'DSBU': ('DSBU', 196.08479231), 'DSS': ('DSS', 138.06808), 'DSSO': ('DSSO', 158.00376), 'Oxidation of Methionine': ('Oxidation', 15.994915), 'PhoX': ('PhoX', 209.97181)}#

Dictionary that maps sequence elements and modifications from Scout to their corresponding post-translational-modifications.

Dictionary that maps sequence elements (e.g. “+57.021460”) and modifications (e.g. “Carbamidomethyl”) from Scout to their corresponding post-translational-modifications (e.g. (“Carbamidomethyl”, 57.021464)).

Examples

>>> from pyXLMS.constants import SCOUT_MODIFICATION_MAPPING
>>> SCOUT_MODIFICATION_MAPPING["+57.021460"]
('Carbamidomethyl', 57.021464)
>>> SCOUT_MODIFICATION_MAPPING["Carbamidomethyl"]
('Carbamidomethyl', 57.021464)
>>> SCOUT_MODIFICATION_MAPPING["Oxidation of Methionine"]
('Oxidation', 15.994915)
pyXLMS.constants.XI_MODIFICATION_MAPPING = {'->': ('Substitution', nan), 'bs3_ami': ('BS3 Amidated', 155.094619105), 'bs3_hyd': ('BS3 Hydrolized', 156.0786347), 'bs3_tris': ('BS3 Tris', 259.141973), 'bs3loop': ('BS3 Looplink', 138.06808), 'bs3nh2': ('BS3 Amidated', 155.094619105), 'bs3oh': ('BS3 Hydrolized', 156.0786347), 'cm': ('Carbamidomethyl', 57.021464), 'dsbu_ami': ('DSBU Amidated', 213.111341), 'dsbu_hyd': ('DSBU Hydrolized', 214.095357), 'dsbu_loop': ('DSBU Looplink', 196.08479231), 'dsbu_tris': ('DSBU Tris', 317.158685), 'dsbuloop': ('DSBU Looplink', 196.08479231), 'dsso_ami': ('DSSO Amidated', 175.030313905), 'dsso_hyd': ('DSSO Hydrolized', 176.0143295), 'dsso_loop': ('DSSO Looplink', 158.00376), 'dsso_tris': ('DSSO Tris', 279.077658), 'dssoloop': ('DSSO Looplink', 158.00376), 'ox': ('Oxidation', 15.994915)}#

Dictionary that maps sequence elements from xiSearch and xiFDR to their corresponding post-translational-modifications.

Dictionary that maps sequence elements (e.g. “cm”) from xiSearch and xiFDR to their corresponding post-translational-modifications (e.g. (“Carbamidomethyl”, 57.021464)).

Examples

>>> from pyXLMS.constants import XI_MODIFICATION_MAPPING
>>> XI_MODIFICATION_MAPPING["cm"]
('Carbamidomethyl', 57.021464)
>>> XI_MODIFICATION_MAPPING["ox"]
('Oxidation', 15.994915)

pyXLMS.data module#

pyXLMS.data.check_indexing(value: int | List[int]) bool[source]#

Checks that the given value is not 0-based.

Parameters:

value (int, or list of int) – The value(s) to check.

Returns:

If the given value(s) is/are okay.

Return type:

bool

Raises:

ValueError – If any of the values are smaller than one.

Examples

>>> from pyXLMS.data import check_indexing
>>> check_indexing([1, 2, 3])
True
pyXLMS.data.check_input(
parameter: Any,
parameter_name: str,
supported_class: Any,
supported_subclass: Any | None = None,
) bool[source]#

Checks if the given parameter is of the specified type.

Function that checks if a given parameter is of the specified type and if iterable, all elements are of the specified element type. This is mostly an input check function to catch any errors arising from not supported inputs early.

Parameters:
  • parameter (any) – Parameter to check class of.

  • parameter_name (str) – Name of the parameter.

  • supported_class (any) – Class the parameter has to be of.

  • supported_subclass (any, or None, default = None) – Class of the values in case the parameter is a list or dict.

Returns:

If the given input is okay.

Return type:

bool

Raises:

TypeError – If the parameter is not of the given class.

Examples

>>> from pyXLMS.data import check_input
>>> check_input("PEPTIDE", "peptide_a", str)
True
>>> from pyXLMS.data import check_input
>>> check_input([1, 2], "xl_position_proteins_a", list, int)
True
pyXLMS.data.check_input_multi(
parameter: Any,
parameter_name: str,
supported_classes: List[Any],
supported_subclass: Any | None = None,
) bool[source]#

Checks if the given parameter is of one of the specified types.

Function that checks if a given parameter is of one of the specified types and if iterable, all elements are of the specified element type. This is mostly an input check function to catch any errors arising from not supported inputs early.

Parameters:
  • parameter (any) – Parameter to check class of.

  • parameter_name (str) – Name of the parameter.

  • supported_class (list of any) – Classes the parameter has to be of.

  • supported_subclass (any, or None, default = None) – Class of the values in case the parameter is a list or dict.

Returns:

If the given input is okay.

Return type:

bool

Raises:

TypeError – If the parameter is not of one of the given classes.

Examples

>>> from pyXLMS.data import check_input_multi
>>> check_input_multi("PEPTIDE", "peptide_a", [str, list])
True

Creates a crosslink data structure.

Contains minimal data necessary for representing a single crosslink. The returned crosslink data structure is a dictionary with keys as detailed in the return section.

Parameters:
  • peptide_a (str) – The unmodified amino acid sequence of the first peptide.

  • xl_position_peptide_a (int) – The position of the crosslinker in the sequence of the first peptide (1-based).

  • proteins_a (list of str, or None) – The accessions of proteins that the first peptide is associated with.

  • xl_position_proteins_a (list of int, or None) – Positions of the crosslink in the proteins of the first peptide (1-based).

  • decoy_a (bool, or None) – Whether the alpha peptide is from the decoy database or not.

  • peptide_b (str) – The unmodified amino acid sequence of the second peptide.

  • xl_position_peptide_b (int) – The position of the crosslinker in the sequence of the second peptide (1-based).

  • proteins_b (list of str, or None) – The accessions of proteins that the second peptide is associated with.

  • xl_position_proteins_b (list of int, or None) – Positions of the crosslink in the proteins of the second peptide (1-based).

  • decoy_b (bool, or None) – Whether the beta peptide is from the decoy database or not.

  • score (float, or None) – Score of the crosslink.

  • additional_information (dict with str keys, or None, default = None) – A dictionary with additional information associated with the crosslink.

Returns:

The dictionary representing the crosslink with keys data_type, completeness, alpha_peptide, alpha_peptide_crosslink_position, alpha_proteins, alpha_proteins_crosslink_positions, alpha_decoy, beta_peptide, beta_peptide_crosslink_position, beta_proteins, beta_proteins_crosslink_positions, beta_decoy, crosslink_type, score, and additional_information. Alpha and beta are assigned based on peptide sequence, the peptide that alphabetically comes first is assigned to alpha.

Return type:

dict

Raises:
  • TypeError – If the parameter is not of the given class.

  • ValueError – If the length of crosslink positions is not equal to the length of proteins.

Notes

The minimum required data for creating a crosslink is:

  • peptide_a: The unmodified amino acid sequence of the first peptide.

  • peptide_b: The unmodified amino acid sequence of the second peptide.

  • xl_position_peptide_a: The position of the crosslinker in the sequence of the first peptide (1-based).

  • xl_position_peptide_b: The position of the crosslinker in the sequence of the second peptide (1-based).

Examples

>>> from pyXLMS.data import create_crosslink
>>> minimal_crosslink = create_crosslink("PEPTIDEA", 1, None, None, None, "PEPTIDEB", 5, None, None, None, None)
>>> crosslink = create_crosslink("PEPTIDEA", 1, ["PROTEINA"], [1], False, "PEPTIDEB", 5, ["PROTEINB"], [3], False, 34.5)

Creates a crosslink data structure from a crosslink-spectrum-match.

Creates a crosslink data structure from a crosslink-spectrum-match. The returned crosslink data structure is a dictionary with keys as detailed in the return section.

Parameters:

csm (dict of str) – The crosslink-spectrum-match item to be converted to a crosslink item.

Returns:

The dictionary representing the crosslink with keys data_type, completeness, alpha_peptide, alpha_peptide_crosslink_position, alpha_proteins, alpha_proteins_crosslink_positions, alpha_decoy, beta_peptide, beta_peptide_crosslink_position, beta_proteins, beta_proteins_crosslink_positions, beta_decoy, crosslink_type, score, and additional_information. Alpha and beta are assigned based on peptide sequence, the peptide that alphabetically comes first is assigned to alpha.

Return type:

dict

Raises:

TypeError – If parameter csm is not a valid crosslink-spectrum-match.

Notes

See also data.create_crosslink().

Examples

>>> from pyXLMS.data import create_csm_min, create_crosslink_from_csm
>>> csm = create_csm_min("PEPTIDEA", 1, "PEPTIDEB", 5, "RUN_1", 1)
>>> crosslink = create_crosslink_from_csm(csm)

Creates a crosslink data structure from minimal input.

Contains minimal data necessary for representing a single crosslink. This is an alias for data.create_crosslink()``that sets all optional parameters to ``None for convenience. The returned crosslink data structure is a dictionary with keys as detailed in the return section.

Parameters:
  • peptide_a (str) – The unmodified amino acid sequence of the first peptide.

  • xl_position_peptide_a (int) – The position of the crosslinker in the sequence of the first peptide (1-based).

  • peptide_b (str) – The unmodified amino acid sequence of the second peptide.

  • xl_position_peptide_b (int) – The position of the crosslinker in the sequence of the second peptide (1-based).

  • **kwargs – Any additional parameters will be passed to data.create_crosslink().

Returns:

The dictionary representing the crosslink with keys data_type, completeness, alpha_peptide, alpha_peptide_crosslink_position, alpha_proteins, alpha_proteins_crosslink_positions, alpha_decoy, beta_peptide, beta_peptide_crosslink_position, beta_proteins, beta_proteins_crosslink_positions, beta_decoy, crosslink_type, score, and additional_information. Alpha and beta are assigned based on peptide sequence, the peptide that alphabetically comes first is assigned to alpha.

Return type:

dict

Notes

See also data.create_crosslink().

Examples

>>> from pyXLMS.data import create_crosslink_min
>>> minimal_crosslink = create_crosslink_min("PEPTIDEA", 1, "PEPTIDEB", 5)
pyXLMS.data.create_csm(
peptide_a: str,
modifications_a: Dict[int, Tuple[str, float]] | None,
xl_position_peptide_a: int,
proteins_a: List[str] | None,
xl_position_proteins_a: List[int] | None,
pep_position_proteins_a: List[int] | None,
score_a: float | None,
decoy_a: bool | None,
peptide_b: str,
modifications_b: Dict[int, Tuple[str, float]] | None,
xl_position_peptide_b: int,
proteins_b: List[str] | None,
xl_position_proteins_b: List[int] | None,
pep_position_proteins_b: List[int] | None,
score_b: float | None,
decoy_b: bool | None,
score: float | None,
spectrum_file: str,
scan_nr: int,
charge: int | None,
rt: float | None,
im_cv: float | None,
additional_information: Dict[str, Any] | None = None,
) Dict[str, Any][source]#

Creates a crosslink-spectrum-match data structure.

Contains minimal data necessary for representing a single crosslink-spectrum-match. The returned crosslink-spectrum-match data structure is a dictionary with keys as detailed in the return section.

Parameters:
  • peptide_a (str) – The unmodified amino acid sequence of the first peptide.

  • modifications_a (dict of [int, tuple], or None) – The modifications of the first peptide given as a dictionary that maps peptide position (1-based) to modification given as a tuple of modification name and modification delta mass. N-terminal modifications should be denoted with position 0. C-terminal modifications should be denoted with position len(peptide) + 1. If the peptide is not modified an empty dictionary should be given.

  • xl_position_peptide_a (int) – The position of the crosslinker in the sequence of the first peptide (1-based).

  • proteins_a (list of str, or None) – The accessions of proteins that the first peptide is associated with.

  • xl_position_proteins_a (list of int, or None) – Positions of the crosslink in the proteins of the first peptide (1-based).

  • pep_position_proteins_a (list of int, or None) – Positions of the first peptide in the corresponding proteins (1-based).

  • score_a (float, or None) – Identification score of the first peptide.

  • decoy_a (bool, or None) – Whether the alpha peptide is from the decoy database or not.

  • peptide_b (str) – The unmodified amino acid sequence of the second peptide.

  • modifications_b (dict of [int, tuple], or None) – The modifications of the second peptide given as a dictionary that maps peptide position (1-based) to modification given as a tuple of modification name and modification delta mass. N-terminal modifications should be denoted with position 0. C-terminal modifications should be denoted with position len(peptide) + 1. If the peptide is not modified an empty dictionary should be given.

  • xl_position_peptide_b (int) – The position of the crosslinker in the sequence of the second peptide (1-based).

  • proteins_b (list of str, or None) – The accessions of proteins that the second peptide is associated with.

  • xl_position_proteins_b (list of int, or None) – Positions of the crosslink in the proteins of the second peptide (1-based).

  • pep_position_proteins_b (list of int, or None) – Positions of the second peptide in the corresponding proteins (1-based).

  • score_b (float, or None) – Identification score of the second peptide.

  • decoy_b (bool, or None) – Whether the beta peptide is from the decoy database or not.

  • score (float, or None) – Score of the crosslink-spectrum-match.

  • spectrum_file (str) – Name of the spectrum file the crosslink-spectrum-match was identified in.

  • scan_nr (int) – The corresponding scan number of the crosslink-spectrum-match.

  • charge (int, or None) – The precursor charge of the corresponding mass spectrum of the crosslink-spectrum-match.

  • rt (float, or None) – The retention time of the corresponding mass spectrum of the crosslink-spectrum-match in seconds.

  • im_cv (float, or None) – The ion mobility or compensation voltage of the corresponding mass spectrum of the crosslink-spectrum-match.

  • additional_information (dict with str keys, or None, default = None) – A dictionary with additional information associated with the crosslink-spectrum-match.

Returns:

The dictionary representing the crosslink-spectrum-match with keys data_type, completeness, alpha_peptide, alpha_modifications, alpha_peptide_crosslink_position, alpha_proteins, alpha_proteins_crosslink_positions, alpha_proteins_peptide_positions, alpha_score, alpha_decoy, beta_peptide, beta_modifications, beta_peptide_crosslink_position, beta_proteins, beta_proteins_crosslink_positions, beta_proteins_peptide_positions, beta_score, beta_decoy, crosslink_type, score, spectrum_file, scan_nr, retention_time, ion_mobility, and additional_information. Alpha and beta are assigned based on peptide sequence, the peptide that alphabetically comes first is assigned to alpha.

Return type:

dict

Raises:
  • TypeError – If the parameter is not of the given class.

  • ValueError – If the length of crosslink positions or peptide positions is not equal to the length of proteins.

Notes

The minimum required data for creating a crosslink-spectrum-match is:

  • peptide_a: The unmodified amino acid sequence of the first peptide.

  • peptide_b: The unmodified amino acid sequence of the second peptide.

  • xl_position_peptide_a: The position of the crosslinker in the sequence of the first peptide (1-based).

  • xl_position_peptide_b: The position of the crosslinker in the sequence of the second peptide (1-based).

  • spectrum_file: Name of the spectrum file the crosslink-spectrum-match was identified in.

  • scan_nr: The corresponding scan number of the crosslink-spectrum-match.

Examples

>>> from pyXLMS.data import create_csm
>>> minimal_csm = create_csm("PEPTIDEA", {}, 1, None, None, None, None, None, "PEPTIDEB", {}, 5, None, None, None, None, None, None, "MS_EXP1", 1, None, None, None)
>>> csm = create_csm("PEPTIDEA", {1: ("Oxidation", 15.994915)}, 1, ["PROTEINA"], [1], [1], 20.1, False, "PEPTIDEB", {}, 5, ["PROTEINB"], [3], [1], 33.7, False, 20.1, "MS_EXP1", 1, 3, 13.5, -50)
pyXLMS.data.create_csm_min(
peptide_a: str,
xl_position_peptide_a: int,
peptide_b: str,
xl_position_peptide_b: int,
spectrum_file: str,
scan_nr: int,
**kwargs,
) Dict[str, Any][source]#

Creates a crosslink-spectrum-match data structure from minimal input.

Contains minimal data necessary for representing a single crosslink-spectrum-match. This is an alias for data.create_csm()``that sets all optional parameters to ``None for convenience. The returned crosslink-spectrum-match data structure is a dictionary with keys as detailed in the return section.

Parameters:
  • peptide_a (str) – The unmodified amino acid sequence of the first peptide.

  • xl_position_peptide_a (int) – The position of the crosslinker in the sequence of the first peptide (1-based).

  • peptide_b (str) – The unmodified amino acid sequence of the second peptide.

  • xl_position_peptide_b (int) – The position of the crosslinker in the sequence of the second peptide (1-based).

  • spectrum_file (str) – Name of the spectrum file the crosslink-spectrum-match was identified in.

  • scan_nr (int) – The corresponding scan number of the crosslink-spectrum-match.

  • **kwargs – Any additional parameters will be passed to data.create_csm().

Returns:

The dictionary representing the crosslink-spectrum-match with keys data_type, completeness, alpha_peptide, alpha_modifications, alpha_peptide_crosslink_position, alpha_proteins, alpha_proteins_crosslink_positions, alpha_proteins_peptide_positions, alpha_score, alpha_decoy, beta_peptide, beta_modifications, beta_peptide_crosslink_position, beta_proteins, beta_proteins_crosslink_positions, beta_proteins_peptide_positions, beta_score, beta_decoy, crosslink_type, score, spectrum_file, scan_nr, retention_time, ion_mobility, and additional_information. Alpha and beta are assigned based on peptide sequence, the peptide that alphabetically comes first is assigned to alpha.

Return type:

dict

Notes

See also data.create_csm().

Examples

>>> from pyXLMS.data import create_csm_min
>>> minimal_csm = create_csm("PEPTIDEA", 1, "PEPTIDEB", 5, "MS_EXP1", 1)
pyXLMS.data.create_parser_result(
search_engine: str,
csms: List[Dict[str, Any]] | None,
crosslinks: List[Dict[str, Any]] | None,
) Dict[str, Any][source]#

Creates a parser result data structure.

Contains all necessary data elements that should be contained in a result returned by a crosslink search engine result parser.

Parameters:
  • search_engine (str) – Name of the identifying crosslink search engine.

  • csms (list of dict, or None) – List of crosslink-spectrum-matches as created by data.create_csm().

  • crosslinks (list of dict, or None) – List of crosslinks as created by data.create_crosslink().

Returns:

The parser result data structure which is a dictionary with keys data_type, completeness, search_engine, crosslink-spectrum-matches and crosslinks.

Return type:

dict

Examples

>>> from pyXLMS.data import create_parser_result
>>> result = create_parser_result("MS Annika", None, None)
>>> result["data_type"]
'parser_result'
>>> result["completeness"]
'empty'
>>> result["search_engine"]
'MS Annika'

pyXLMS.exporter module#

pyXLMS.exporter_to_impxfdr module#

pyXLMS.exporter_to_impxfdr.to_impxfdr(
data: List[Dict[str, Any]],
filename: str | None,
targets_only: bool = True,
) DataFrame[source]#

Exports a list of crosslinks or crosslink-spectrum-matches to IMP-X-FDR format.

Exports a list of crosslinks or crosslink-spectrum-matches to IMP-X-FDR format for benchmarking purposes. The tool IMP-X-FDR is available from github.com/vbc-proteomics-org/imp-x-fdr. We recommend using version 1.1.0 and selecting “MS Annika” as input file format for the here exported file. A slightly modified version is available from github.com/hgb-bin-proteomics/MSAnnika_NC_Results. This version contains a few bug fixes and was used for the MS Annika 2.0 and MS Annika 3.0 publications. Requires that alpha_proteins, beta_proteins, alpha_proteins_crosslink_positions and beta_proteins_crosslink_positions fields are set for crosslinks and crosslink-spectrum-matches.

Parameters:
  • data (list of dict of str, any) – A list of crosslinks or crosslink-spectrum-matches.

  • filename (str, or None, default = None) – If not None, the exported data will be written to a file with the specified filename. The filename should end in “.xlsx” as the file is exported to Microsoft Excel file format.

  • targets_only (bool, default = True) – Whether or not only target crosslinks or crosslink-spectrum-matches should be exported. For benchmarking purposes this is usually the case. If the crosslinks or crosslink-spectrum-matches do not contain target-decoy labels this should be set to False.

Returns:

A pandas DataFrame containing crosslinks or crosslink-spectrum-matches in IMP-X-FDR format.

Return type:

pd.DataFrame

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If data contains elements of mixed data type.

  • ValueError – If the provided data contains no elements or if none of the data has target-decoy labels and parameter ‘targets_only’ is set to True.

  • RuntimeError – If not all of the required information is present in the input data.

Examples

>>> from pyXLMS.exporter import to_impxfdr
>>> from pyXLMS.parser import read
>>> pr = read("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv", engine="xiSearch/xiFDR", crosslinker="DSS")
>>> crosslinks = pr["crosslinks"]
>>> to_impxfdr(crosslinks, filename="crosslinks.xlsx")
    Crosslink Type             Sequence A  Position A Accession A In protein A  ... Position B  Accession B In protein B Best CSM Score  Decoy
0            Intra          VVDELV[K]VMGR           7        Cas9          753  ...          7         Cas9          753         40.679  False
1            Intra  MLASAGELQ[K]GNELALPSK          10        Cas9          753  ...          7         Cas9         1226         40.231  False
2            Intra        MDGTEELLV[K]LNR          10        Cas9          396  ...         10         Cas9          396         39.582  False
3            Intra         MTNFD[K]NLPNEK           6        Cas9          965  ...          2         Cas9          504         35.880  False
4            Intra             DFQFY[K]VR           6        Cas9          978  ...          4         Cas9         1028         35.281  False
..             ...                    ...         ...         ...          ...  ...        ...          ...          ...            ...    ...
220          Intra        LP[K]YSLFELENGR           3        Cas9          866  ...          3         Cas9         1204          9.877  False
221          Intra               D[K]QSGK           2        Cas9          677  ...          2         Cas9          677          9.702  False
222          Intra               AGFI[K]R           5        Cas9          922  ...         11         Cas9          881          9.666  False
223          Intra                E[K]IEK           2        Cas9          443  ...          1         Cas9          562          9.656  False
224          Intra                LS[K]SR           3        Cas9          222  ...          3         Cas9          222          9.619  False
[225 rows x 11 columns]
>>> from pyXLMS.exporter import to_impxfdr
>>> from pyXLMS.parser import read
>>> pr = read("data/xi/1perc_xl_boost_CSM_xiFDR2.2.1.csv", engine="xiSearch/xiFDR", crosslinker="DSS")
>>> csms = pr["crosslink-spectrum-matches"]
>>> to_impxfdr(csms, filename="csms.xlsx")
    Crosslink Type          Sequence A  Position A Accession A In protein A  ... Position B  Accession B In protein B Best CSM Score  Decoy
0            Intra  [K]IECFDSVEISGVEDR           1        Cas9          575  ...          1         Cas9          575         27.268  False
1            Intra       LVDSTD[K]ADLR           7        Cas9          152  ...         11         Cas9          881         26.437  False
2            Intra     GGLSELD[K]AGFIK           8        Cas9          917  ...          8         Cas9          917         26.134  False
3            Intra       LVDSTD[K]ADLR           7        Cas9          152  ...          7         Cas9          152         25.804  False
4            Intra       VVDELV[K]VMGR           7        Cas9          753  ...          7         Cas9          753         24.861  False
..             ...                 ...         ...         ...          ...  ...        ...          ...          ...            ...    ...
406          Intra          [K]GILQTVK           1        Cas9          739  ...          3         Cas9          222          6.977  False
407          Intra          QQLPE[K]YK           6        Cas9          350  ...          6         Cas9          350          6.919  False
408          Intra           ESILP[K]R           6        Cas9         1117  ...          7         Cas9         1035          6.853  False
409          Intra             LS[K]SR           3        Cas9          222  ...          2         Cas9          884          6.809  False
410          Intra     QIT[K]HVAQILDSR           4        Cas9          933  ...          6         Cas9          350          6.808  False
[411 rows x 11 columns]

pyXLMS.exporter_to_msannika module#

Returns the crosslinked peptide sequence in MS Annika format.

Returns the crosslinked peptide sequence in MS Annika format, which is the peptide amino acid sequence with the crosslinked residue in square brackets (see examples).

Parameters:
  • peptide (str) – The (unmodified) amino acid sequence of the peptide.

  • crosslink_position (int) – Position of the crosslinker in the peptide sequence (1-based).

Returns:

The crosslinked peptide sequence in MS Annika format.

Return type:

str

Raises:

ValueError – If the crosslink position is outside the peptide’s length.

Examples

>>> from pyXLMS.exporter import get_msannika_crosslink_sequence
>>> get_msannika_crosslink_sequence("PEPKTIDE", 4)
'PEP[K]TIDE'
>>> from pyXLMS.exporter import get_msannika_crosslink_sequence
>>> get_msannika_crosslink_sequence("KPEPTIDE", 1)
'[K]PEPTIDE'
>>> from pyXLMS.exporter import get_msannika_crosslink_sequence
>>> get_msannika_crosslink_sequence("PEPTIDEK", 8)
'PEPTIDE[K]'
pyXLMS.exporter_to_msannika.to_msannika(
data: List[Dict[str, Any]],
filename: str | None = None,
format: Literal['csv', 'tsv', 'xlsx'] = 'csv',
) DataFrame[source]#

Exports a list of crosslinks or crosslink-spectrum-matches to MS Annika format.

Exports a list of crosslinks or crosslink-spectrum-matches to MS Annika format. This might be useful for tools that support MS Annika input but are not supported by pyXLMS (yet).

Parameters:
  • data (list of dict of str, any) – A list of crosslinks or crosslink-spectrum-matches.

  • filename (str, or None, default = None) – If not None, the exported data will be written to a file with the specified filename.

  • format (str, one of "csv", "tsv", or "xlsx", default = "csv") – File format of the exported file if filename is not None.

Returns:

A pandas DataFrame containing crosslinks or crosslink-spectrum-matches in MS Annika format.

Return type:

pd.DataFrame

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If data contains elements of mixed data type.

  • TypeError – If parameter format is not one of ‘csv’, ‘tsv’ or ‘xlsx’.

  • ValueError – If the provided data contains no elements.

Warning

The MS Annika exporter will not check if all necessary information is available for the exported crosslinks or crosslink-spectrum-matches. If a value is not available it will be denoted as a missing value in the dataframe and exported file. Please make sure all necessary information is available before using the exported file with another tool! Please also note that modifications are not exported, for modification down-stream analysis please refer to transform.to_proforma() or transform.to_dataframe()!

Examples

>>> from pyXLMS.exporter import to_msannika
>>> from pyXLMS.data import create_crosslink_min
>>> xl1 = create_crosslink_min("KPEPTIDE", 1, "PKEPTIDE", 2)
>>> xl2 = create_crosslink_min("PEKPTIDE", 3, "PEPKTIDE", 4)
>>> crosslinks = [xl1, xl2]
>>> to_msannika(crosslinks)
  Crosslink Type  Sequence A  Position A Accession A In protein A  Sequence B  Position B Accession B In protein B Best CSM Score Decoy
0          Inter  [K]PEPTIDE           1        None         None  P[K]EPTIDE           2        None         None           None  None
1          Inter  PE[K]PTIDE           3        None         None  PEP[K]TIDE           4        None         None           None  None
>>> from pyXLMS.exporter import to_msannika
>>> from pyXLMS.data import create_crosslink_min
>>> xl1 = create_crosslink_min("KPEPTIDE", 1, "PKEPTIDE", 2)
>>> xl2 = create_crosslink_min("PEKPTIDE", 3, "PEPKTIDE", 4)
>>> crosslinks = [xl1, xl2]
>>> df = to_msannika(crosslinks, filename = "crosslinks.csv", format = "csv")
>>> from pyXLMS.exporter import to_msannika
>>> from pyXLMS.data import create_csm_min
>>> csm1 = create_csm_min("KPEPTIDE", 1, "PKEPTIDE", 2, "RUN_1", 1)
>>> csm2 = create_csm_min("PEKPTIDE", 3, "PEPKTIDE", 4, "RUN_1", 2)
>>> csms = [csm1, csm2]
>>> to_msannika(csms)
            Sequence Crosslink Type Sequence A  Crosslinker Position A  ... First Scan Charge RT [min] Compensation Voltage
0  KPEPTIDE-PKEPTIDE          Inter   KPEPTIDE                       1  ...          1   None     None                 None
1  PEKPTIDE-PEPKTIDE          Inter   PEKPTIDE                       3  ...          2   None     None                 None
[2 rows x 20 columns]
>>> from pyXLMS.exporter import to_msannika
>>> from pyXLMS.data import create_csm_min
>>> csm1 = create_csm_min("KPEPTIDE", 1, "PKEPTIDE", 2, "RUN_1", 1)
>>> csm2 = create_csm_min("PEKPTIDE", 3, "PEPKTIDE", 4, "RUN_1", 2)
>>> csms = [csm1, csm2]
>>> df = to_msannika(csms, filename = "csms.csv", format = "csv")

pyXLMS.exporter_to_pyxlinkviewer module#

pyXLMS.exporter_to_pyxlinkviewer.to_pyxlinkviewer(
crosslinks: List[Dict[str, Any]],
pdb_file: str | BinaryIO,
gap_open: int | float = -10.0,
gap_extension: int | float = -1.0,
min_sequence_identity: float = 0.8,
allow_site_mismatch: bool = False,
ignore_chains: List[str] = [],
filename_prefix: str | None = None,
) Dict[str, Any][source]#

Exports a list of crosslinks to PyXlinkViewer format.

Exports a list of crosslinks to PyXlinkViewer format for visualization in pyMOL. The tool PyXlinkViewer is available from github.com/BobSchiffrin/PyXlinkViewer. This exporter performs basical local sequence alignment to align crosslinked peptides to a protein structure in PDB format. Gap open and gap extension penalties can be chosen as well as a threshold for sequence identity that must be satisfied in order for a match to be reported. Additionally the alignment is checked if the supposedly crosslinked residue can be modified with a crosslinker in the protein structure. Due to the alignment shift amino acids might change and a crosslink is reported at a position that is not able to react with the crosslinker. Optionally, these positions can still be reported.

Parameters:
  • crosslinks (list of dict of str, any) – A list of crosslinks.

  • pdb_file (str, or file stream) – The name/path of the PDB file or a file-like object/stream. If a string is provided but no file is found locally, it’s assumed to be an identifier and the file is fetched from the PDB.

  • gap_open (int, or float, default = -10.0) – Gap open penalty for sequence alignment.

  • gap_extension (int, or float, default = -1.0,) – Gap extension penalty for sequence alignment.

  • min_sequence_identity (float, default = 0.8) – Minimum sequence identity to consider an aligned crosslinked peptide a match with its corresponding position in the protein structure. Should be given as a fraction between 0 and 1, e.g. the default of 0.8 corresponds to a minimum of 80% sequence identity.

  • allow_site_mismatch (bool, default = False) – If the crosslink position after alignment is not a reactive amino acid in the protein structure, should the position still be reported. By default such cases are not reported.

  • ignore_chains (list of str, default = empty list) – A list of chains to ignore in the protein structure.

  • filename_prefix (str, or None, default = None) – If not None, the exported data will be written to files with the specified filename prefix. The full list of written files can be accessed via the returned dictionary.

Returns:

Returns a dictionary with key PyXlinkViewer containing the formatted text for PyXlinkViewer, with key PyXlinkViewer DataFrame containing the information from PyXlinkViewer but as a pandas DataFrame, with key Number of mapped crosslinks containing the total number of mapped crosslinks, with key Mapping containing a string that logs how crosslinks were mapped to the protein structure, with key Parsed PDB sequence containing the protein sequence that was parsed from the PDB file, with key Parsed PDB chains containing the parsed chains from the PDB file, with key Parsed PDB residue numbers containing the parsed residue numbers from the PDB file, and with key Exported files containing a list of filenames of all files that were written to disk.

Return type:

dict of str, any

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If data contains elements of mixed data type.

  • ValueError – If parameter min_sequence_identity is out of bounds.

  • ValueError – If the provided data contains no elements.

Examples

>>> from pyXLMS.exporter import to_pyxlinkviewer
>>> from pyXLMS.parser import read_custom
>>> pr = read_custom("data/_test/exporter/pyxlinkviewer/unique_links_all_pyxlms.csv")
>>> crosslinks = pr["crosslinks"]
>>> pyxlinkviewer_result = to_pyxlinkviewer(crosslinks, pdb_file="6YHU", filename_prefix="6YHU")
>>> pyxlinkviewer_output_file_str = pyxlinkviewer_result["PyXlinkViewer"]
>>> pyxlinkviewer_dataframe = pyxlinkviewer_result["PyXlinkViewer DataFrame"]
>>> nr_mapped_crosslinks = pyxlinkviewer_result["Number of mapped crosslinks"]
>>> crosslink_mapping = pyxlinkviewer_result["Mapping"]
>>> parsed_pdb_sequenece = pyxlinkviewer_result["Parsed PDB sequence"]
>>> parsed_pdb_chains = pyxlinkviewer_result["Parsed PDB chains"]
>>> parsed_pdb_residue_numbers = pyxlinkviewer_result["Parsed PDB residue numbers"]
>>> exported_files = pyxlinkviewer_result["Exported files"]

pyXLMS.exporter_to_xifdr module#

pyXLMS.exporter_to_xifdr.to_xifdr(
csms: List[Dict[str, Any]],
filename: str | None,
) DataFrame[source]#

Exports a list of crosslink-spectrum-matches to xiFDR format.

Exports a list of crosslinks to xiFDR format. The tool xiFDR is accessible via the link rappsilberlab.org/software/xifdr. Requires that alpha_proteins, beta_proteins, alpha_proteins_peptide_positions, beta_proteins_peptide_positions, alpha_decoy, beta_decoy, charge and score fields are set for all crosslink-spectrum-matches.

Parameters:
  • csms (list of dict of str, any) – A list of crosslink-spectrum-matches.

  • filename (str, or None) – If not None, the exported data will be written to a file with the specified filename.

Returns:

A pandas DataFrame containing crosslink-spectrum-matches in xiFDR format.

Return type:

pd.DataFrame

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If ‘csms’ parameter contains elements of mixed data type.

  • ValueError – If the provided ‘csms’ parameter contains no elements.

  • RuntimeError – If not all of the required information is present in the input data.

Examples

>>> from pyXLMS.exporter import to_xifdr
>>> from pyXLMS.parser import read
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> csms = pr["crosslink-spectrum-matches"]
>>> to_xifdr(csms, filename="msannika_xiFDR.csv")
                                       run   scan          peptide1  ... peptide position 1  peptide position 2   score
0    XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw   2257            GQKNSR  ...                777                 777  119.83
1    XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw   2448            GQKNSR  ...                777                 693   13.91
2    XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw   2561             SDKNR  ...                864                 864  114.43
3    XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw   2719            DKQSGK  ...                676                 676  200.98
4    XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw   2792            DKQSGK  ...                676                  45   94.47
..                                     ...    ...               ...  ...                ...                 ...     ...
821  XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw  23297     MDGTEELLVKLNR  ...                387                 387  286.05
822  XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw  23454  KIECFDSVEISGVEDR  ...                575                 682  376.15
823  XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw  23581    SSFEKNPIDFLEAK  ...               1176                1176  412.44
824  XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw  23683    SSFEKNPIDFLEAK  ...               1176                1176  437.10
825  XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw  27087    MEDESKLHKFKDFK  ...                 99                1176   15.89
[826 rows x 14 columns]
>>> from pyXLMS.exporter import to_xifdr
>>> from pyXLMS.parser import read
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> csms = pr["crosslink-spectrum-matches"]
>>> df = to_xifdr(csms, filename=None)

pyXLMS.exporter_to_xinet module#

pyXLMS.exporter_to_xinet.to_xinet(
crosslinks: List[Dict[str, Any]],
filename: str | None,
) DataFrame[source]#

Exports a list of crosslinks to xiNET format.

Exports a list of crosslinks to xiNET format. The tool xiNET is accessible via the link crosslinkviewer.org. Requires that alpha_proteins, beta_proteins, alpha_proteins_crosslink_positions and beta_proteins_crosslink_positions fields are set for all crosslinks.

Parameters:
  • crosslinks (list of dict of str, any) – A list of crosslinks.

  • filename (str, or None) – If not None, the exported data will be written to a file with the specified filename.

Returns:

A pandas DataFrame containing crosslinks in xiNET format.

Return type:

pd.DataFrame

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If ‘crosslinks’ parameter contains elements of mixed data type.

  • ValueError – If the provided ‘crosslinks’ parameter contains no elements.

  • RuntimeError – If not all of the required information is present in the input data.

Notes

The optional Score column in the xiNET table will only be available if all crosslinks have assigned scores.

Examples

>>> from pyXLMS.exporter import to_xinet
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> from pyXLMS.transform import filter_proteins
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> crosslinks = targets_only(pr)["crosslinks"]
>>> cas9 = filter_proteins(crosslinks, proteins=["Cas9"])["Both"]
>>> to_xinet(cas9, filename="crosslinks_xiNET.csv")
    Protein1 PepPos1           PepSeq1  LinkPos1 Protein2 PepPos2         PepSeq2  LinkPos2   Score   Id
0       Cas9     777            GQKNSR         3     Cas9     777          GQKNSR         3  119.83    1
1       Cas9     864             SDKNR         3     Cas9     864           SDKNR         3  114.43    2
2       Cas9     676            DKQSGK         2     Cas9     676          DKQSGK         2  200.98    3
3       Cas9     676            DKQSGK         2     Cas9      45           HSIKK         4   94.47    4
4       Cas9      31             VPSKK         4     Cas9      31           VPSKK         4  110.48    5
..       ...     ...               ...       ...      ...     ...             ...       ...     ...  ...
248     Cas9     387     MDGTEELLVKLNR        10     Cas9     387   MDGTEELLVKLNR        10  305.63  249
249     Cas9     682    TILDFLKSDGFANR         7     Cas9     947       YDENDKLIR         6  110.46  250
250     Cas9     788    IEEGIKELGSQILK         6     Cas9    1176  SSFEKNPIDFLEAK         5  288.36  251
251     Cas9     575  KIECFDSVEISGVEDR         1     Cas9     682  TILDFLKSDGFANR         7  376.15  252
252     Cas9    1176    SSFEKNPIDFLEAK         5     Cas9    1176  SSFEKNPIDFLEAK         5  437.10  253
[253 rows x 10 columns]
>>> from pyXLMS.exporter import to_xinet
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> from pyXLMS.transform import filter_proteins
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> crosslinks = targets_only(pr)["crosslinks"]
>>> cas9 = filter_proteins(crosslinks, proteins=["Cas9"])["Both"]
>>> df = to_xinet(cas9, filename=None)

pyXLMS.exporter_to_xiview module#

pyXLMS.exporter_to_xiview.to_xiview(
crosslinks: List[Dict[str, Any]],
filename: str | None,
minimal: bool = True,
) DataFrame[source]#

Exports a list of crosslinks to xiVIEW format.

Exports a list of crosslinks to xiVIEW format. The tool xiVIEW is accessible via the link xiview.org/. Requires that alpha_proteins, beta_proteins, alpha_proteins_crosslink_positions and beta_proteins_crosslink_positions fields are set for all crosslinks.

Parameters:
  • crosslinks (list of dict of str, any) – A list of crosslinks.

  • filename (str, or None) – If not None, the exported data will be written to a file with the specified filename.

  • minimal (bool, default = True) – Which xiVIEW format to return, if minimal = True the minimal xiVIEW format is returned. Otherwise the “CSV without peak lists” format is returned (internally this just calls exporter.to_xinet()). For more information on the xiVIEW formats please refer to the xiVIEW specification.

Returns:

A pandas DataFrame containing crosslinks in xiVIEW format.

Return type:

pd.DataFrame

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If ‘crosslinks’ parameter contains elements of mixed data type.

  • ValueError – If the provided ‘crosslinks’ parameter contains no elements.

  • RuntimeError – If not all of the required information is present in the input data.

Notes

The optional Score column in the xiVIEW table will only be available if all crosslinks have assigned scores, the optional Decoy* columns will only be available if all crosslinks have assigned target and decoy labels.

Examples

>>> from pyXLMS.exporter import to_xiview
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> from pyXLMS.transform import filter_proteins
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> crosslinks = targets_only(pr)["crosslinks"]
>>> cas9 = filter_proteins(crosslinks, proteins=["Cas9"])["Both"]
>>> to_xiview(cas9, filename="crosslinks_xiVIEW.csv")
    AbsPos1 AbsPos2 Protein1 Protein2 Decoy1 Decoy2   Score
0       779     779     Cas9     Cas9  FALSE  FALSE  119.83
1       866     866     Cas9     Cas9  FALSE  FALSE  114.43
2       677     677     Cas9     Cas9  FALSE  FALSE  200.98
3       677      48     Cas9     Cas9  FALSE  FALSE   94.47
4        34      34     Cas9     Cas9  FALSE  FALSE  110.48
..      ...     ...      ...      ...    ...    ...     ...
248     396     396     Cas9     Cas9  FALSE  FALSE  305.63
249     688     952     Cas9     Cas9  FALSE  FALSE  110.46
250     793    1180     Cas9     Cas9  FALSE  FALSE  288.36
251     575     688     Cas9     Cas9  FALSE  FALSE  376.15
252    1180    1180     Cas9     Cas9  FALSE  FALSE  437.10
[253 rows x 7 columns]
>>> from pyXLMS.exporter import to_xiview
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> from pyXLMS.transform import filter_proteins
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> crosslinks = targets_only(pr)["crosslinks"]
>>> cas9 = filter_proteins(crosslinks, proteins=["Cas9"])["Both"]
>>> df = to_xiview(cas9, filename=None)
>>> from pyXLMS.exporter import to_xiview
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> from pyXLMS.transform import filter_proteins
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> crosslinks = targets_only(pr)["crosslinks"]
>>> cas9 = filter_proteins(crosslinks, proteins=["Cas9"])["Both"]
>>> to_xiview(cas9, filename="crosslinks_xiVIEW.csv", minimal=False)
    Protein1 PepPos1           PepSeq1  LinkPos1 Protein2 PepPos2         PepSeq2  LinkPos2   Score   Id
0       Cas9     777            GQKNSR         3     Cas9     777          GQKNSR         3  119.83    1
1       Cas9     864             SDKNR         3     Cas9     864           SDKNR         3  114.43    2
2       Cas9     676            DKQSGK         2     Cas9     676          DKQSGK         2  200.98    3
3       Cas9     676            DKQSGK         2     Cas9      45           HSIKK         4   94.47    4
4       Cas9      31             VPSKK         4     Cas9      31           VPSKK         4  110.48    5
..       ...     ...               ...       ...      ...     ...             ...       ...     ...  ...
248     Cas9     387     MDGTEELLVKLNR        10     Cas9     387   MDGTEELLVKLNR        10  305.63  249
249     Cas9     682    TILDFLKSDGFANR         7     Cas9     947       YDENDKLIR         6  110.46  250
250     Cas9     788    IEEGIKELGSQILK         6     Cas9    1176  SSFEKNPIDFLEAK         5  288.36  251
251     Cas9     575  KIECFDSVEISGVEDR         1     Cas9     682  TILDFLKSDGFANR         7  376.15  252
252     Cas9    1176    SSFEKNPIDFLEAK         5     Cas9    1176  SSFEKNPIDFLEAK         5  437.10  253
[253 rows x 10 columns]

pyXLMS.exporter_to_xlinkdb module#

pyXLMS.exporter_to_xlinkdb.to_xlinkdb(
crosslinks: List[Dict[str, Any]],
filename: str | None,
) DataFrame[source]#

Exports a list of crosslinks to XlinkDB format.

Exports a list of crosslinks to XlinkDB format. The tool XlinkDB is accessible via the link xlinkdb.gs.washington.edu/xlinkdb. Requires that alpha_proteins and beta_proteins fields are set for all crosslinks.

Parameters:
  • crosslinks (list of dict of str, any) – A list of crosslinks.

  • filename (str, or None) – If not None, the exported data will be written to a file with the specified filename. The filename should not contain a file extension and consist only of alpha-numeric characters (a-Z, 0-9).

Returns:

A pandas DataFrame containing crosslinks in XlinkDB format.

Return type:

pd.DataFrame

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If ‘crosslinks’ parameter contains elements of mixed data type.

  • ValueError – If the filename contains any non-alpha-numeric characters.

  • ValueError – If the provided ‘crosslinks’ parameter contains no elements.

  • RuntimeError – If not all of the required information is present in the input data.

Notes

XlinkDB input format requires a column with probabilities that the crosslinks are correct. Since that is not available from most crosslink search engines, this is simply set to a constant 1.

Examples

>>> from pyXLMS.exporter import to_xlinkdb
>>> from pyXLMS.parser import read
>>> pr = read("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv", engine="xiSearch/xiFDR", crosslinker="DSS")
>>> crosslinks = pr["crosslinks"]
>>> to_xlinkdb(crosslinks, filename="crosslinksForXlinkDB")
               Peptide A Protein A  Labeled Position A      Peptide B Protein B  Labeled Position B  Probability
0            VVDELVKVMGR      Cas9                   6    VVDELVKVMGR      Cas9                   6            1
1    MLASAGELQKGNELALPSK      Cas9                   9    VVDELVKVMGR      Cas9                   6            1
2          MDGTEELLVKLNR      Cas9                   9  MDGTEELLVKLNR      Cas9                   9            1
3           MTNFDKNLPNEK      Cas9                   5       SKLVSDFR      Cas9                   1            1
4               DFQFYKVR      Cas9                   5    MIAKSEQEIGK      Cas9                   3            1
..                   ...       ...                 ...            ...       ...                 ...          ...
222        LPKYSLFELENGR      Cas9                   2          SDKNR      Cas9                   2            1
223               DKQSGK      Cas9                   1         DKQSGK      Cas9                   1            1
224               AGFIKR      Cas9                   4   SDNVPSEEVVKK      Cas9                  10            1
225                EKIEK      Cas9                   1          KVTVK      Cas9                   0            1
226                LSKSR      Cas9                   2          LSKSR      Cas9                   2            1
[227 rows x 7 columns]
>>> from pyXLMS.exporter import to_xlinkdb
>>> from pyXLMS.parser import read
>>> pr = read("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv", engine="xiSearch/xiFDR", crosslinker="DSS")
>>> crosslinks = pr["crosslinks"]
>>> df = to_xlinkdb(crosslinks, filename=None)

pyXLMS.exporter_to_xlmstools module#

pyXLMS.exporter_to_xlmstools.to_xlmstools(
crosslinks: List[Dict[str, Any]],
pdb_file: str | BinaryIO,
gap_open: int | float = -10.0,
gap_extension: int | float = -1.0,
min_sequence_identity: float = 0.8,
allow_site_mismatch: bool = False,
ignore_chains: List[str] = [],
filename_prefix: str | None = None,
) Dict[str, Any][source]#

Exports a list of crosslinks to xlms-tools format.

Exports a list of crosslinks to xlms-tools format for protein structure analysis. The python package xlms-tools is available from gitlab.com/topf-lab/xlms-tools. This exporter performs basical local sequence alignment to align crosslinked peptides to a protein structure in PDB format. Gap open and gap extension penalties can be chosen as well as a threshold for sequence identity that must be satisfied in order for a match to be reported. Additionally the alignment is checked if the supposedly crosslinked residue can be modified with a crosslinker in the protein structure. Due to the alignment shift amino acids might change and a crosslink is reported at a position that is not able to react with the crosslinker. Optionally, these positions can still be reported.

Parameters:
  • crosslinks (list of dict of str, any) – A list of crosslinks.

  • pdb_file (str, or file stream) – The name/path of the PDB file or a file-like object/stream. If a string is provided but no file is found locally, it’s assumed to be an identifier and the file is fetched from the PDB.

  • gap_open (int, or float, default = -10.0) – Gap open penalty for sequence alignment.

  • gap_extension (int, or float, default = -1.0,) – Gap extension penalty for sequence alignment.

  • min_sequence_identity (float, default = 0.8) – Minimum sequence identity to consider an aligned crosslinked peptide a match with its corresponding position in the protein structure. Should be given as a fraction between 0 and 1, e.g. the default of 0.8 corresponds to a minimum of 80% sequence identity.

  • allow_site_mismatch (bool, default = False) – If the crosslink position after alignment is not a reactive amino acid in the protein structure, should the position still be reported. By default such cases are not reported.

  • ignore_chains (list of str, default = empty list) – A list of chains to ignore in the protein structure.

  • filename_prefix (str, or None, default = None) – If not None, the exported data will be written to files with the specified filename prefix. The full list of written files can be accessed via the returned dictionary.

Returns:

Returns a dictionary with key xlms-tools containing the formatted text for xlms-tools, with key xlms-tools DataFrame containing the information from xlms-tools but as a pandas DataFrame, with key Number of mapped crosslinks containing the total number of mapped crosslinks, with key Mapping containing a string that logs how crosslinks were mapped to the protein structure, with key Parsed PDB sequence containing the protein sequence that was parsed from the PDB file, with key Parsed PDB chains containing the parsed chains from the PDB file, with key Parsed PDB residue numbers containing the parsed residue numbers from the PDB file, and with key Exported files containing a list of filenames of all files that were written to disk.

Return type:

dict of str, any

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If data contains elements of mixed data type.

  • ValueError – If parameter min_sequence_identity is out of bounds.

  • ValueError – If the provided data contains no elements.

Notes

Internally this exporter just calls exporter.to_pyxlinkviewer() and re-writes some of the files since the two tools share the same input file structure.

Examples

>>> from pyXLMS.exporter import to_xlmstools
>>> from pyXLMS.parser import read_custom
>>> pr = read_custom("data/_test/exporter/xlms-tools/unique_links_all_pyxlms.csv")
>>> crosslinks = pr["crosslinks"]
>>> xlmstools_result = to_xlmstools(crosslinks, pdb_file="6YHU", filename_prefix="6YHU")
>>> xlmstools_output_file_str = xlmstools_result["xlms-tools"]
>>> xlmstools_dataframe = xlmstools_result["xlms-tools DataFrame"]
>>> nr_mapped_crosslinks = xlmstools_result["Number of mapped crosslinks"]
>>> crosslink_mapping = xlmstools_result["Mapping"]
>>> parsed_pdb_sequenece = xlmstools_result["Parsed PDB sequence"]
>>> parsed_pdb_chains = xlmstools_result["Parsed PDB chains"]
>>> parsed_pdb_residue_numbers = xlmstools_result["Parsed PDB residue numbers"]
>>> exported_files = xlmstools_result["Exported files"]

pyXLMS.exporter_to_xmas module#

pyXLMS.exporter_to_xmas.to_xmas(
crosslinks: List[Dict[str, Any]],
filename: str | None,
) DataFrame[source]#

Exports a list of crosslinks to XMAS format.

Exports a list of crosslinks to XMAS format for visualization in ChimeraX. The tool XMAS is available from github.com/ScheltemaLab/ChimeraX_XMAS_bundle.

Parameters:
  • crosslinks (list of dict of str, any) – A list of crosslinks.

  • filename (str, or None) – If not None, the exported data will be written to a file with the specified filename.

Returns:

A pandas DataFrame containing crosslinks in XMAS format.

Return type:

pd.DataFrame

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If ‘crosslinks’ parameter contains elements of mixed data type.

  • ValueError – If the provided ‘crosslinks’ parameter contains no elements.

Examples

>>> from pyXLMS.exporter import to_xmas
>>> from pyXLMS.data import create_crosslink_min
>>> xl1 = create_crosslink_min("KPEPTIDE", 1, "PKEPTIDE", 2)
>>> xl2 = create_crosslink_min("PEKPTIDE", 3, "PEPKTIDE", 4)
>>> crosslinks = [xl1, xl2]
>>> to_xmas(crosslinks, filename="crosslinks_xmas.xlsx")
   Sequence A  Sequence B
0  [K]PEPTIDE  P[K]EPTIDE
1  PE[K]PTIDE  PEP[K]TIDE
>>> from pyXLMS.exporter import to_xmas
>>> from pyXLMS.data import create_crosslink_min
>>> xl1 = create_crosslink_min("KPEPTIDE", 1, "PKEPTIDE", 2)
>>> xl2 = create_crosslink_min("PEKPTIDE", 3, "PEPKTIDE", 4)
>>> crosslinks = [xl1, xl2]
>>> to_xmas(crosslinks, filename=None)
   Sequence A  Sequence B
0  [K]PEPTIDE  P[K]EPTIDE
1  PE[K]PTIDE  PEP[K]TIDE

pyXLMS.exporter_util module#

pyXLMS.parser module#

pyXLMS.parser.read(
files: str | List[str] | BinaryIO,
engine: Literal['Custom', 'MaxQuant', 'MaxLynx', 'MS Annika', 'mzIdentML', 'pLink', 'Scout', 'xiSearch/xiFDR', 'XlinkX'],
crosslinker: str,
parse_modifications: bool = True,
ignore_errors: bool = False,
verbose: Literal[0, 1, 2] = 1,
**kwargs,
) Dict[str, Any][source]#

Read a crosslink result file.

Reads a crosslink or crosslink-spectrum-match result file from any of the supported crosslink search engines or formats. Currently supports results files from MaxLynx/MaxQuant, MS Annika, pLink 2 and pLink 3, Scout, xiSearch and xiFDR, XlinkX, and the mzIdentML format. Additionally supports parsing from custom .csv files in pyXLMS format, see more about the custom format in parser.read_custom() and in here: docs.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the result file(s) or a file-like object/stream.

  • engine ("Custom", "MaxQuant", "MaxLynx", "MS Annika", "mzIdentML", "pLink", "Scout", "xiSearch/xiFDR", or "XlinkX") – Crosslink search engine or format of the result file.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter for every parser. Defaults are selected for every parser if ‘modifications’ is not passed via **kwargs.

  • ignore_errors (bool, default = False) – Ignore errors when mapping modifications. Used in parser.read_xi() and parser.read_xlinkx().

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

  • **kwargs – Any additional parameters will be passed to the specific parsers.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:

ValueError – If the value entered for parameter engine is not supported.

Examples

>>> from pyXLMS.parser import read
>>> csms_from_xiSearch = read("data/xi/r1_Xi1.7.6.7.csv", engine="xiSearch/xiFDR", crosslinker="DSS")
>>> from pyXLMS.parser import read
>>> csms_from_MaxQuant = read("data/maxquant/run1/crosslinkMsms.txt", engine="MaxQuant", crosslinker="DSS")

pyXLMS.parser_util module#

pyXLMS.parser_util.format_sequence(
sequence: str,
remove_non_aa: bool = True,
remove_lower: bool = True,
) str[source]#

Formats the given amino acid sequence into common represenation.

The given amino acid sequence is re-formatted by converting all amino acids to upper case and optionally removing non-encoding and lower case characters.

Parameters:
  • sequence (str) – The amino acid sequence that should be formatted. Post-translational-modifications can be included in lower case but will be removed.

  • remove_non_aa (bool, default = True) – Whether or not to remove characters that do not encode amino acids.

  • remove_lower (bool, default = True) – Whether or not to remove lower case characters, this should be true if the amino acid sequence encodes post-translational-modifications in lower case.

Returns:

The formatted sequence.

Return type:

str

Examples

>>> from pyXLMS.parser_util import format_sequence
>>> format_sequence("PEP[K]TIDE")
'PEPKTIDE'
>>> from pyXLMS.parser_util import format_sequence
>>> format_sequence("PEPKdssoTIDE")
'PEPKTIDE'
>>> from pyXLMS.parser_util import format_sequence
>>> format_sequence("peptide", remove_lower = False)
'PEPTIDE'
pyXLMS.parser_util.get_bool_from_value(value: Any) bool[source]#

Parse a bool value from the given input.

Tries to parse a boolean value from the given input object. If the object is of instance bool it will return the object, if it is of instance int it will return True if the object is 1 or False if the object is 0, any other number will raise a ValueError. If the object is of instance str it will return True if the lower case version contains the letter t and otherwise False. If the object is none of these types a ValueError will be raised.

Parameters:

value (Any) – The value to parse from.

Returns:

The parsed boolean value.

Return type:

bool

Raises:

ValueError – If the object could not be parsed to bool.

Examples

>>> from pyXLMS.parser_util import get_bool_from_value
>>> get_bool_from_value(0)
False
>>> from pyXLMS.parser_util import get_bool_from_value
>>> get_bool_from_value("T")
True

pyXLMS.parser_xldbse_custom module#

pyXLMS.parser_xldbse_custom.pyxlms_modification_str_parser(
modifications: str,
) Dict[int, Tuple[str, float]][source]#

Parse a pyXLMS modification string.

Parses a pyXLMS modification string and returns the pyXLMS specific modification object, a dictionary that maps positions to their modififications.

Parameters:

modifications (str) – The pyXLMS modification string.

Returns:

The pyXLMS specific modification object, a dictionary that maps positions (1-based) to their respective modifications given as tuples of modification name and modification delta mass.

Return type:

dict of int, tuple

Raises:

RuntimeError – If multiple modifications on the same residue are parsed.

Examples

>>> from pyXLMS.parser import pyxlms_modification_str_parser
>>> modification_str = "(1:[DSS|138.06808])"
>>> pyxlms_modification_str_parser(modification_str)
{1: ('DSS', 138.06808)}
>>> from pyXLMS.parser import pyxlms_modification_str_parser
>>> modification_str = "(1:[DSS|138.06808]);(7:[Oxidation|15.994915])"
>>> pyxlms_modification_str_parser(modification_str)
{1: ('DSS', 138.06808), 7: ('Oxidation', 15.994915)}
pyXLMS.parser_xldbse_custom.read_custom(
files: str | List[str] | BinaryIO,
column_mapping: Dict[str, str] | None = None,
parse_modifications: bool = True,
modification_parser: Callable[[str], Dict[int, Tuple[str, float]]] | None = None,
decoy_prefix: str = 'REV_',
format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx'] = 'auto',
sep: str = ',',
decimal: str = '.',
) Dict[str, Any][source]#

Read a custom or pyXLMS result file.

Reads a custom or pyXLMS crosslink-spectrum-matches result file or crosslink result file in .csv or .xlsx format, and returns a parser_result.

The minimum required columns for a crosslink-spectrum-matches result file are:

  • “Alpha Peptide”: The unmodified amino acid sequence of the first peptide.

  • “Alpha Peptide Crosslink Position”: The position of the crosslinker in the sequence of the first peptide (1-based).

  • “Beta Peptide”: The unmodified amino acid sequence of the second peptide.

  • “Beta Peptide Crosslink Position”: The position of the crosslinker in the sequence of the second peptide (1-based).

  • “Spectrum File”: Name of the spectrum file the crosslink-spectrum-match was identified in.

  • “Scan Nr”: The corresponding scan number of the crosslink-spectrum-match.

The minimum required columns for crosslink result file are:

  • “Alpha Peptide”: The unmodified amino acid sequence of the first peptide.

  • “Alpha Peptide Crosslink Position”: The position of the crosslinker in the sequence of the first peptide (1-based).

  • “Beta Peptide”: The unmodified amino acid sequence of the second peptide.

  • “Beta Peptide Crosslink Position”: The position of the crosslinker in the sequence of the second peptide (1-based).

A full specification of columns that can be parsed can be found in the docs.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the result file(s) or a file-like object/stream.

  • column_mapping (dict of str, str) – A dictionary that maps the result file columns to the required pyXLMS column names.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modification_parser’ parameter.

  • modification_parser (callable, or None) – A function that parses modification strings and returns the pyXLMS specific modifications object. If None, the function pyxlms_modification_str_parser() is used. If no modification columns are given this parameter is ignored.

  • decoy_prefix (str, default = "REV_") – The prefix that indicates that a protein is from the decoy database.

  • format ("auto", "csv", "tsv", "txt", or "xlsx", default = "auto") – The format of the result file. "auto" is only available if the name/path to the result file is given.

  • sep (str, default = ",") – Seperator used in the .csv or .tsv file. Parameter is ignored if the file is in .xlsx format.

  • decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in .xlsx format.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • ValueError – If the input format is not supported or cannot be inferred.

  • TypeError – If one of the values could not be parsed.

  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

Examples

>>> from pyXLMS.parser import read_custom
>>> csms_from_pyxlms = read_custom("data/pyxlms/csm.txt")
>>> from pyXLMS.parser import read_custom
>>> crosslinks_from_pyxlms = read_custom("data/pyxlms/xl.txt")

pyXLMS.parser_xldbse_maxquant module#

pyXLMS.parser_xldbse_maxquant.parse_modifications_from_maxquant_sequence(
seq: str,
crosslink_position: int,
crosslinker: str,
crosslinker_mass: float,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
) Dict[int, Tuple[str, float]][source]#

Parse post-translational-modifications from a MaxQuant peptide sequence.

Parses post-translational-modifications (PTMs) from a MaxQuant peptide sequence, for example “_VVDELVKVM(Oxidation (M))GR_”.

Parameters:
  • seq (str) – The MaxQuant sequence string.

  • crosslink_position (int) – Position of the crosslinker in the sequence (1-based).

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float) – Monoisotopic delta mass of the crosslink modification.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

Returns:

The pyXLMS specific modifications object, a dictionary that maps positions to their corresponding modifications and their monoisotopic masses.

Return type:

dict of int, tuple

Raises:
  • RuntimeError – If the sequence could not be parsed because it is not in MaxQuant format.

  • RuntimeError – If multiple modifications on the same residue are parsed.

  • KeyError – If an unknown modification is encountered.

Examples

>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence
>>> seq = "_VVDELVKVM(Oxidation (M))GR_"
>>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808)
{2: ('DSS', 138.06808), 9: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence
>>> seq = "_VVDELVKVM(Oxidation (M))GRM(Oxidation (M))_"
>>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808)
{2: ('DSS', 138.06808), 9: ('Oxidation', 15.994915), 12: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_maxquant_sequence
>>> seq = "_M(Oxidation (M))VVDELVKVM(Oxidation (M))GRM(Oxidation (M))_"
>>> parse_modifications_from_maxquant_sequence(seq, 2, "DSS", 138.06808)
{2: ('DSS', 138.06808), 1: ('Oxidation', 15.994915), 10: ('Oxidation', 15.994915), 13: ('Oxidation', 15.994915)}
pyXLMS.parser_xldbse_maxquant.read_maxlynx(
files: str | List[str] | BinaryIO,
crosslinker: str,
crosslinker_mass: float | None = None,
decoy_prefix: str = 'REV__',
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
sep: str = '\t',
decimal: str = '.',
) Dict[str, Any][source]#

Read a MaxLynx result file.

Reads a MaxLynx crosslink-spectrum-matches result file “crosslinkMsms.txt” in .txt (tab delimited) format and returns a parser_result. This is an alias for the MaxQuant reader.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the MaxLynx result file(s) or a file-like object/stream.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.

  • decoy_prefix (str, default = "REV__") – The prefix that indicates that a protein is from the decoy database.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • sep (str, default = "t") – Seperator used in the .txt file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • KeyError – If the specified crosslinker could not be found/mapped.

Warning

MaxLynx/MaxQuant only reports a single protein crosslink position per peptide, for ambiguous peptides only the crosslink position of the first matching protein is reported. All matching proteins can be retrieved via additional_information, however not their corresponding crosslink positions. For this reason it is recommended to use transform.reannotate_positions() to correctly annotate all crosslink positions for all peptides if that is important for downstream analysis.

Examples

>>> from pyXLMS.parser import read_maxlynx
>>> csms_from_xlsx = read_maxlynx("data/maxquant/run1/crosslinkMsms.txt")
pyXLMS.parser_xldbse_maxquant.read_maxquant(
files: str | List[str] | BinaryIO,
crosslinker: str,
crosslinker_mass: float | None = None,
decoy_prefix: str = 'REV__',
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
sep: str = '\t',
decimal: str = '.',
) Dict[str, Any][source]#

Read a MaxQuant result file.

Reads a MaxQuant crosslink-spectrum-matches result file “crosslinkMsms.txt” in .txt (tab delimited) format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the MaxQuant result file(s) or a file-like object/stream.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.

  • decoy_prefix (str, default = "REV__") – The prefix that indicates that a protein is from the decoy database.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • sep (str, default = "t") – Seperator used in the .txt file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • KeyError – If the specified crosslinker could not be found/mapped.

Warning

MaxLynx/MaxQuant only reports a single protein crosslink position per peptide, for ambiguous peptides only the crosslink position of the first matching protein is reported. All matching proteins can be retrieved via additional_information, however not their corresponding crosslink positions. For this reason it is recommended to use transform.reannotate_positions() to correctly annotate all crosslink positions for all peptides if that is important for downstream analysis.

Examples

>>> from pyXLMS.parser import read_maxquant
>>> csms = read_maxquant("data/maxquant/run1/crosslinkMsms.txt")

pyXLMS.parser_xldbse_msannika module#

pyXLMS.parser_xldbse_msannika.read_msannika(
files: str | List[str] | BinaryIO,
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx', 'pdresult'] = 'auto',
sep: str = '\t',
decimal: str = '.',
unsafe: bool = False,
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read an MS Annika result file.

Reads an MS Annika crosslink-spectrum-matches result file or crosslink result file in .csv or .xlsx format, or both from a .pdResult file from Proteome Discover, and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the MS Annika result file(s) or a file-like object/stream.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • format ("auto", "csv", "tsv", "txt", "xlsx", or "pdresult", default = "auto") – The format of the result file. "auto" is only available if the name/path to the MS Annika result file is given.

  • sep (str, default = "t") – Seperator used in the .csv or .tsv file. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • unsafe (bool, default = False) – If True, allows reading of negative peptide and crosslink positions but replaces their values with None. Negative values occur when peptides can’t be matched to proteins because of ‘X’ in protein sequences. Reannotation might be possible with transform.reannotate_positions().

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • ValueError – If the input format is not supported or cannot be inferred.

  • TypeError – If the pdResult file is provided in the wrong format.

  • TypeError – If parameter verbose was not set correctly.

  • RuntimeError – If one of the crosslinks or crosslink-spectrum-matches contains unknown crosslink or peptide positions. This occurs when peptides can’t be matched to proteins because of ‘X’ in protein sequences. Selecting ‘unsafe = True’ will ignore these errors and return None type positions. Reannotation might be possible with transform.reannotate_positions().

  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • KeyError – If one of the found post-translational-modifications could not be found/mapped.

Warning

MS Annika does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation. This also only applies to crosslinks and not crosslink-spectrum-matches, where this information is correctly reported and parsed.

Examples

>>> from pyXLMS.parser import read_msannika
>>> csms_from_xlsx = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx")
>>> from pyXLMS.parser import read_msannika
>>> crosslinks_from_xlsx = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx")
>>> from pyXLMS.parser import read_msannika
>>> csms_from_tsv = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.txt")
>>> from pyXLMS.parser import read_msannika
>>> crosslinks_from_tsv = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.txt")
>>> from pyXLMS.parser import read_msannika
>>> csms_and_crosslinks_from_pdresult = read_msannika("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1.pdResult")

pyXLMS.parser_xldbse_mzid module#

pyXLMS.parser_xldbse_mzid.parse_scan_nr_from_mzid(spectrum_id: str) int[source]#

Parse the scan number from a ‘spectrumID’ of a mzIdentML file.

Parameters:

title (str) – The ‘spectrumID’ of the mass spectrum from an mzIdentML file read with pyteomics.

Returns:

The scan number.

Return type:

int

Examples

>>> from pyXLMS.parser import parse_scan_nr_from_mzid
>>> parse_scan_nr_from_mzid("scan=5321")
5321
pyXLMS.parser_xldbse_mzid.read_mzid(
files: str | List[str] | BinaryIO,
scan_nr_parser: Callable[[str], int] | None = None,
decoy: bool | None = None,
crosslinkers: Dict[str, float] = {'ADH': 138.09054635, 'BS3': 138.06808, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'PhoX': 209.97181},
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read a mzIdentML (mzid) file.

Reads crosslink-spectrum-matches from a mzIdentML (mzid) file and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the mzIdentML (mzid) file(s) or a file-like object/stream.

  • scan_nr_parser (callable, or None, default = None) – A function that parses the scan number from mzid spectrumIDs. If None (default) the function parse_scan_nr_from_mzid() is used.

  • decoy (bool, or None, default = None) – Whether the mzid file contains decoy CSMs (True) or target CSMs (False).

  • crosslinkers (dict of str, float, default = constants.CROSSLINKERS) – Mapping of crosslinker names to crosslinker delta masses.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslink-spectrum-matches.

  • RuntimeError – If parser is used with verbose = 2.

  • RuntimeError – If there are warnings while reading the mzIdentML file (only for verbose = 2).

  • TypeError – If parameter verbose was not set correctly.

  • TypeError – If one of the values necessary to create a crosslink-spectrum-match could not be parsed correctly.

Notes

This parser is experimental, as I don’t know if the mzIdentML structure is consistent accross different crosslink search engines. This parser was tested with mzIdentML files from MS Annika and XlinkX.

Warning

This parser only parses minimal data because most information is not available from the mzIdentML file. The available data is:

  • alpha_peptide

  • alpha_peptide_crosslink_position

  • beta_peptide

  • beta_peptide_crosslink_position

  • spectrum_file

  • scan_nr

Examples

>>> from pyXLMS.parser import read_mzid
>>> csms = read_mzid("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid")

pyXLMS.parser_xldbse_scout module#

pyXLMS.parser_xldbse_scout.detect_scout_filetype(
data: DataFrame,
) Literal['scout_csms_unfiltered', 'scout_csms_filtered', 'scout_xl'][source]#

Detects the Scout-related source of the data.

Detects whether the input data is unfiltered crosslink-spectrum-matches, filtered crosslink-spectrum-matches, or crosslinks from Scout.

Parameters:

data (pd.DataFrame) – The input data originating from Scout.

Returns:

“scout_csms_unfiltered” if a Scout unfiltered CSMs file was read, “scout_csms_filtered” if a Scout filtered CSMs file was read, “scout_xl” if a Scout crosslink/residue pair result file was read.

Return type:

str

Raises:

ValueError – If the data source could not be determined.

Examples

>>> from pyXLMS.parser import detect_scout_filetype
>>> import pandas as pd
>>> df1 = pd.read_csv("data/scout/Cas9_Unfiltered_CSMs.csv")
>>> detect_scout_filetype(df1)
'scout_csms_unfiltered'
>>> from pyXLMS.parser import detect_scout_filetype
>>> import pandas as pd
>>> df2 = pd.read_csv("data/scout/Cas9_Filtered_CSMs.csv")
>>> detect_scout_filetype(df2)
'scout_csms_filtered'
>>> from pyXLMS.parser import detect_scout_filetype
>>> import pandas as pd
>>> df3 = pd.read_csv("data/scout/Cas9_Residue_Pairs.csv")
>>> detect_scout_filetype(df3)
'scout_xl'
pyXLMS.parser_xldbse_scout.parse_modifications_from_scout_sequence(
seq: str,
crosslink_position: int,
crosslinker: str,
crosslinker_mass: float,
modifications: Dict[str, Tuple[str, float]] = {'+15.994900': ('Oxidation', 15.994915), '+57.021460': ('Carbamidomethyl', 57.021464), 'ADH': ('ADH', 138.09054635), 'BS3': ('BS3', 138.06808), 'Carbamidomethyl': ('Carbamidomethyl', 57.021464), 'DSBSO': ('DSBSO', 308.03883), 'DSBU': ('DSBU', 196.08479231), 'DSS': ('DSS', 138.06808), 'DSSO': ('DSSO', 158.00376), 'Oxidation of Methionine': ('Oxidation', 15.994915), 'PhoX': ('PhoX', 209.97181)},
verbose: Literal[0, 1, 2] = 1,
) Dict[int, Tuple[str, float]][source]#

Parse post-translational-modifications from a Scout peptide sequence.

Parses post-translational-modifications (PTMs) from a Scout peptide sequence, for example “M(+15.994900)LASAGELQKGNELALPSK”.

Parameters:
  • seq (str) – The Scout sequence string.

  • crosslink_position (int) – Position of the crosslinker in the sequence (1-based).

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float) – Monoisotopic delta mass of the crosslink modification.

  • modifications (dict of str, float, default = constants.SCOUT_MODIFICATION_MAPPING) – Mapping of modification names to modification masses.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The pyXLMS specific modifications object, a dictionary that maps positions to their corresponding modifications and their monoisotopic masses.

Return type:

dict of int, tuple

Raises:
  • RuntimeError – If multiple modifications on the same residue are parsed (only if verbose = 2).

  • KeyError – If an unknown modification is encountered.

Examples

>>> from pyXLMS.parser import parse_modifications_from_scout_sequence
>>> seq = "M(+15.994900)LASAGELQKGNELALPSK"
>>> parse_modifications_from_scout_sequence(seq, 10, "DSS", 138.06808)
{10: ('DSS', 138.06808), 1: ('Oxidation', 15.994915)}
>>> from pyXLMS.parser import parse_modifications_from_scout_sequence
>>> seq = "KIEC(+57.021460)FDSVEISGVEDR"
>>> parse_modifications_from_scout_sequence(seq, 1, "DSS", 138.06808)
{1: ('DSS', 138.06808), 4: ('Carbamidomethyl', 57.021464)}
pyXLMS.parser_xldbse_scout.read_scout(
files: str | List[str] | BinaryIO,
crosslinker: str,
crosslinker_mass: float | None = None,
parse_modifications: bool = True,
modifications: Dict[str, Tuple[str, float]] = {'+15.994900': ('Oxidation', 15.994915), '+57.021460': ('Carbamidomethyl', 57.021464), 'ADH': ('ADH', 138.09054635), 'BS3': ('BS3', 138.06808), 'Carbamidomethyl': ('Carbamidomethyl', 57.021464), 'DSBSO': ('DSBSO', 308.03883), 'DSBU': ('DSBU', 196.08479231), 'DSS': ('DSS', 138.06808), 'DSSO': ('DSSO', 158.00376), 'Oxidation of Methionine': ('Oxidation', 15.994915), 'PhoX': ('PhoX', 209.97181)},
sep: str = ',',
decimal: str = '.',
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read a Scout result file.

Reads a Scout filtered or unfiltered crosslink-spectrum-matches result file or crosslink/residue pair result file in .csv format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the Scout result file(s) or a file-like object/stream.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • crosslinker_mass (float, or None, default = None) – Monoisotopic delta mass of the crosslink modification. If the crosslinker is defined in parameter “modifications” this can be omitted.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, tuple, default = constants.SCOUT_MODIFICATION_MAPPING) – Mapping of Scout sequence elements (e.g. "+15.994900") and modifications (e.g "Oxidation of Methionine") to their modifications (e.g. ("Oxidation", 15.994915)).

  • sep (str, default = ",") – Seperator used in the .csv file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • KeyError – If the specified crosslinker could not be found/mapped.

  • TypeError – If parameter verbose was not set correctly.

Warning

  • When reading unfiltered crosslink-spectrum-matches, no protein crosslink positions or protein peptide positions are available, as these are not reported. If needed they should be annotated with transform.reannotate_positions().

  • When reading filtered crosslink-spectrum-matches, Scout does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink-spectrum-match are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink-spectrum-match. This leads to only TT and DD matches, which needs to be considered for FDR estimation.

  • When reading crosslinks / residue pairs, Scout does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation.

Examples

>>> from pyXLMS.parser import read_scout
>>> csms_unfiltered = read_scout("data/scout/Cas9_Unfiltered_CSMs.csv")
>>> from pyXLMS.parser import read_scout
>>> csms_filtered = read_scout("data/scout/Cas9_Filtered_CSMs.csv")
>>> from pyXLMS.parser import read_scout
>>> crosslinks = read_scout("data/scout/Cas9_Residue_Pairs.csv")

pyXLMS.parser_xldbse_xi module#

pyXLMS.parser_xldbse_xi.detect_xi_filetype(
data: DataFrame,
) Literal['xisearch', 'xifdr_csms', 'xifdr_crosslinks'][source]#

Detects the xi-related source (application) of the data.

Detects whether the input data is originating from xiSearch or xiFDR, and if xiFDR which type of data is being read (crosslink-spectrum-matches or crosslinks).

Parameters:

data (pd.DataFrame) – The input data originating from xiSearch or xiFDR.

Returns:

“xisearch” if a xiSearch result file was read, “xifdr_csms” if CSMs from xiFDR were read, “xifdr_crosslinks” if crosslinks from xiFDR were read.

Return type:

str

Raises:

ValueError – If the data source could not be determined.

Examples

>>> from pyXLMS.parser import detect_xi_filetype
>>> import pandas as pd
>>> df1 = pd.read_csv("data/xi/r1_Xi1.7.6.7.csv")
>>> detect_xi_filetype(df1)
'xisearch'
>>> from pyXLMS.parser import detect_xi_filetype
>>> import pandas as pd
>>> df2 = pd.read_csv("data/xi/1perc_xl_boost_CSM_xiFDR2.2.1.csv")
>>> detect_xi_filetype(df2)
'xifdr_csms'
>>> from pyXLMS.parser import detect_xi_filetype
>>> import pandas as pd
>>> df3 = pd.read_csv("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv")
>>> detect_xi_filetype(df3)
'xifdr_crosslinks'
pyXLMS.parser_xldbse_xi.parse_modifications_from_xi_sequence(sequence: str) Dict[int, str][source]#

Parses all post-translational-modifications from a peptide sequence as reported by xiFDR.

Parses all post-translational-modifications from a peptide sequence as reported by xiFDR. This assumes that amino acids are given in upper case letters and post-translational-modifications in lower case letters. The parsed modifications are returned as a dictionary that maps their position in the sequence (1-based) to their xiFDR annotation (SYMBOLEXT), for example "cm" or "ox".

Parameters:

sequence (str) – The peptide sequence as given by xiFDR.

Returns:

Dictionary that maps modifications (values) to their respective positions in the peptide sequence (1-based) (keys). The modifications are given in xiFDR annotation style (SYMBOLEXT) which is the lower letter modification code, for example "cm" for carbamidomethylation.

Return type:

dict of int, str

Raises:

RuntimeError – If multiple modifications on the same residue are parsed.

Examples

>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq1 = "KIECcmFDSVEISGVEDR"
>>> parse_modifications_from_xi_sequence(seq1)
{4: 'cm'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq2 = "KIECcmFDSVEMoxISGVEDR"
>>> parse_modifications_from_xi_sequence(seq2)
{4: 'cm', 10: 'ox'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq3 = "KIECcmFDSVEISGVEDRMox"
>>> parse_modifications_from_xi_sequence(seq3)
{4: 'cm', 17: 'ox'}
>>> from pyXLMS.parser import parse_modifications_from_xi_sequence
>>> seq4 = "CcmKIECcmFDSVEISGVEDRMox"
>>> parse_modifications_from_xi_sequence(seq4)
{1: 'cm', 5: 'cm', 18: 'ox'}
pyXLMS.parser_xldbse_xi.parse_peptide(sequence: str, term_char: str = '.') str[source]#

Parses the peptide sequence from a sequence string including flanking amino acids.

Parses the peptide sequence from a sequence string including flanking amino acids, for example "K.KKMoxKLS.S". The returned peptide sequence for this example would be "KKMoxKLS".

Parameters:
  • sequence (str) – The sequence string containing the peptide sequence and flanking amino acids.

  • term_char (str (single character), default = ".") – The character used to denote N-terminal and C-terminal.

Returns:

The parsed peptide sequence without flanking amino acids.

Return type:

str

Raises:

RuntimeError – If (one of) the peptide sequence(s) could not be parsed.

Examples

>>> from pyXLMS.parser import parse_peptide
>>> parse_peptide("K.KKMoxKLS.S")
'KKMoxKLS'
>>> from pyXLMS.parser import parse_peptide
>>> parse_peptide("-.CcmCcmPSR.T")
'CcmCcmPSR'
>>> from pyXLMS.parser import parse_peptide
>>> parse_peptide("CCPSR")
'CCPSR'
pyXLMS.parser_xldbse_xi.read_xi(
files: str | List[str] | BinaryIO,
decoy_prefix: str | None = 'auto',
parse_modifications: bool = True,
modifications: Dict[str, Tuple[str, float]] = {'->': ('Substitution', nan), 'bs3_ami': ('BS3 Amidated', 155.094619105), 'bs3_hyd': ('BS3 Hydrolized', 156.0786347), 'bs3_tris': ('BS3 Tris', 259.141973), 'bs3loop': ('BS3 Looplink', 138.06808), 'bs3nh2': ('BS3 Amidated', 155.094619105), 'bs3oh': ('BS3 Hydrolized', 156.0786347), 'cm': ('Carbamidomethyl', 57.021464), 'dsbu_ami': ('DSBU Amidated', 213.111341), 'dsbu_hyd': ('DSBU Hydrolized', 214.095357), 'dsbu_loop': ('DSBU Looplink', 196.08479231), 'dsbu_tris': ('DSBU Tris', 317.158685), 'dsbuloop': ('DSBU Looplink', 196.08479231), 'dsso_ami': ('DSSO Amidated', 175.030313905), 'dsso_hyd': ('DSSO Hydrolized', 176.0143295), 'dsso_loop': ('DSSO Looplink', 158.00376), 'dsso_tris': ('DSSO Tris', 279.077658), 'dssoloop': ('DSSO Looplink', 158.00376), 'ox': ('Oxidation', 15.994915)},
sep: str = ',',
decimal: str = '.',
ignore_errors: bool = False,
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read a xiSearch/xiFDR result file.

Reads a xiSearch crosslink-spectrum-matches result file or a xiFDR crosslink-spectrum-matches result file or crosslink result file in .csv format and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the xiSearch/xiFDR result file(s) or a file-like object/stream.

  • decoy_prefix (str, or None, default = "auto") – The prefix that indicates that a protein is from the decoy database. If “auto” or None it will use the default for each xi file type.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, tuple, default = constants.XI_MODIFICATION_MAPPING) – Mapping of xi sequence elements (e.g. "cm") to their modifications (e.g. ("Carbamidomethyl", 57.021464)). This corresponds to the SYMBOLEXT field, or the SYMBOL field minus the amino acid in the xiSearch config.

  • sep (str, default = ",") – Seperator used in the .csv file.

  • decimal (str, default = ".") – Character to recognize as decimal point.

  • ignore_errors (bool, default = False) – If modifications that are not given in parameter ‘modifications’ should raise an error or not. By default an error is raised if an unknown modification is encountered. If True modifications that are unknown are encoded with the xi shortcode (SYMBOLEXT) and float("nan") modification mass.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • RuntimeError – If the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • TypeError – If parameter verbose was not set correctly.

Examples

>>> from pyXLMS.parser import read_xi
>>> csms_from_xiSearch = read_xi("data/xi/r1_Xi1.7.6.7.csv")
>>> from pyXLMS.parser import read_xi
>>> csms_from_xiFDR = read_xi("data/xi/1perc_xl_boost_CSM_xiFDR2.2.1.csv")
>>> from pyXLMS.parser import read_xi
>>> crosslinks_from_xiFDR = read_xi("data/xi/1perc_xl_boost_Links_xiFDR2.2.1.csv")

pyXLMS.parser_xldbse_xlinkx module#

pyXLMS.parser_xldbse_xlinkx.read_xlinkx(
files: str | List[str] | BinaryIO,
decoy: bool | None = None,
parse_modifications: bool = True,
modifications: Dict[str, float] = {'ADH': 138.09054635, 'Acetyl': 42.010565, 'BS3': 138.06808, 'Carbamidomethyl': 57.021464, 'DSBSO': 308.03883, 'DSBU': 196.08479231, 'DSS': 138.06808, 'DSSO': 158.00376, 'Oxidation': 15.994915, 'PhoX': 209.97181, 'Phospho': 79.966331},
format: Literal['auto', 'csv', 'txt', 'tsv', 'xlsx', 'pdresult'] = 'auto',
sep: str = '\t',
decimal: str = '.',
ignore_errors: bool = False,
verbose: Literal[0, 1, 2] = 1,
) Dict[str, Any][source]#

Read an XlinkX result file.

Reads an XlinkX crosslink-spectrum-matches result file or crosslink result file in .csv or .xlsx format, or both from a .pdResult file from Proteome Discover, and returns a parser_result.

Parameters:
  • files (str, list of str, or file stream) – The name/path of the XlinkX result file(s) or a file-like object/stream.

  • decoy (bool, or None) – Default decoy value to use if no decoy value is found. Only used if the “Is Decoy” column is not found in the supplied data.

  • parse_modifications (bool, default = True) – Whether or not post-translational-modifications should be parsed for crosslink-spectrum-matches. Requires correct specification of the ‘modifications’ parameter.

  • modifications (dict of str, float, default = constants.MODIFICATIONS) – Mapping of modification names to modification masses.

  • format ("auto", "csv", "tsv", "txt", "xlsx", or "pdresult", default = "auto") – The format of the result file. "auto" is only available if the name/path to the XlinkX result file is given.

  • sep (str, default = "t") – Seperator used in the .csv or .tsv file. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • decimal (str, default = ".") – Character to recognize as decimal point. Parameter is ignored if the file is in .xlsx or .pdResult format.

  • ignore_errors (bool, default = False) – If missing crosslink positions should raise an error or not. Setting this to True will suppress the RuntimeError for the crosslink position not being able to be parsed for at least one of the crosslinks. For these cases the crosslink position will be set to 100 000.

  • verbose (0, 1, or 2, default = 1) –

    • 0: All warnings are ignored.

    • 1: Warnings are printed to stdout.

    • 2: Warnings are treated as errors.

Returns:

The parser_result object containing all parsed information.

Return type:

dict

Raises:
  • ValueError – If the input format is not supported or cannot be inferred.

  • TypeError – If parameter verbose was not set correctly.

  • TypeError – If the pdResult file is provided in the wrong format.

  • RuntimeError – If the crosslink position could not be parsed for at least one of the crosslinks.

  • RuntimeError – If the file(s) could not be read or if the file(s) contain no crosslinks or crosslink-spectrum-matches.

  • KeyError – If one of the found post-translational-modifications could not be found/mapped.

Warning

XlinkX does not report if the individual peptides in a crosslink are from the target or decoy database. The parser assumes that both peptides from a target crosslink are from the target database, and vice versa, that both peptides are from the decoy database if it is a decoy crosslink. This leads to only TT and DD matches, which needs to be considered for FDR estimation. This applies to both crosslinks and crosslink-spectrum-matches.

Examples

>>> from pyXLMS.parser import read_xlinkx
>>> csms_from_xlsx = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_CSMs.xlsx")
>>> from pyXLMS.parser import read_xlinkx
>>> crosslinks_from_xlsx = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_Crosslinks.xlsx")
>>> from pyXLMS.parser import read_xlinkx
>>> csms_from_tsv = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_CSMs.txt")
>>> from pyXLMS.parser import read_xlinkx
>>> crosslinks_from_tsv = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3_Crosslinks.txt")
>>> from pyXLMS.parser import read_xlinkx
>>> csms_and_crosslinks_from_pdresult = read_xlinkx("data/xlinkx/XLpeplib_Beveridge_Lumos_DSSO_MS3.pdResult")

pyXLMS.pipelines module#

pyXLMS.pipelines.pipeline(
files: str | List[str] | BinaryIO,
engine: Literal['Custom', 'MaxQuant', 'MaxLynx', 'MS Annika', 'mzIdentML', 'pLink', 'Scout', 'xiSearch/xiFDR', 'XlinkX'],
crosslinker: str,
unique: bool | Dict[str, Any] | None = True,
validate: bool | Dict[str, Any] | None = True,
targets_only: bool | None = True,
**kwargs,
) Dict[str, Any][source]#

Runs a standard down-stream analysis pipeline for crosslinks and crosslink-spectrum-matches.

Runs a standard down-stream analysis pipeline for crosslinks and crosslink-spectrum-matches. The pipeline first reads a result file and subsequently optionally filters the the read data for unique crosslinks and crosslink-spectrum-matches, optionally the data is validated by false discovery rate estimation and - also optionally - only target-target matches are returned. Internally the pipeline calls parser.read(), transform.unique(), transform.validate(), and transform.targets_only().

Parameters:
  • files (str, list of str, or file stream) – The name/path of the result file(s) or a file-like object/stream.

  • engine ("Custom", "MaxQuant", "MaxLynx", "MS Annika", "mzIdentML", "pLink", "Scout", "xiSearch/xiFDR", or "XlinkX") – Crosslink search engine or format of the result file.

  • crosslinker (str) – Name of the used cross-linking reagent, for example “DSSO”.

  • unique (dict of str, any, or bool, or None, default = True) – If transform.unique() should be run in the pipeline. If None or False this step is omitted. If True this step is run with default parameters. If a dictionary is given it should contain parameters for running transform.unique(). Omitting a parameter in the dictionary will fall back to its default value.

  • validate (dict of str, any, or bool, or None, default = True) – If transform.validate() should be run in the pipeline. If None or False this step is omitted. If True this step is run with default parameters. If a dictionary is given it should contain parameters for running transform.validate(). Omitting a parameter in the dictionary will fall back to its default value.

  • targets_only (bool, or None, default = True) – If transform.targets_only() should be run in the pipeline. If None or False this step is omitted.

  • **kwargs – Any additional parameters will be passed to the specific result file parsers.

Returns:

The transformed parser_result after all pipeline steps are completed.

Return type:

dict of str, any

Raises:

TypeError – If any of the parameters do not have the correct type.

Notes

Various helpful pipeline information is also printed to stdout.

Examples

>>> from pyXLMS.pipelines import pipeline
>>> pr = pipeline("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx",
...               engine="MS Annika",
...               crosslinker="DSS",
...               unique=True,
...               validate={"fdr": 0.05, "formula":"(TD-DD)/TT"},
...               targets_only=True)
Reading MS Annika CSMs...: 100%|██████████████████████████████████████████████████| 826/826 [00:00<00:00, 10337.98it/s]
---- Summary statistics before pipeline ----
Number of CSMs: 826.0
Number of unique CSMs: 826.0
Number of intra CSMs: 803.0
Number of inter CSMs: 23.0
Number of target-target CSMs: 786.0
Number of target-decoy CSMs: 39.0
Number of decoy-decoy CSMs: 1.0
Minimum CSM score: 1.11
Maximum CSM score: 452.99
Iterating over scores for FDR calculation...:   0%|                                            | 0/826 [00:00<?, ?it/s]
---- Summary statistics after pipeline ----
Number of CSMs: 786.0
Number of unique CSMs: 786.0
Number of intra CSMs: 774.0
Number of inter CSMs: 12.0
Number of target-target CSMs: 786.0
Number of target-decoy CSMs: 0.0
Number of decoy-decoy CSMs: 0.0
Minimum CSM score: 1.28
Maximum CSM score: 452.99
---- Performed pipeline steps ----
:: parser.read() ::
:: parser.read() :: params :: <params omitted>
:: transform.unique() ::
:: transform.unique() :: params :: by=peptide
:: transform.unique() :: params :: score=higher_better
:: transform.validate() ::
:: transform.validate() :: params :: fdr=0.05
:: transform.validate() :: params :: formula=(TD-DD)/TT
:: transform.validate() :: params :: score=higher_better
:: transform.validate() :: params :: separate_intra_inter=False
:: transform.validate() :: params :: ignore_missing_labels=False
:: transform.targets_only() ::
:: transform.targets_only() :: params :: no params

pyXLMS.transform module#

pyXLMS.transform_aggregate module#

pyXLMS.transform_aggregate.aggregate(
csms: List[Dict[str, Any]],
by: Literal['peptide', 'protein'] = 'peptide',
score: Literal['higher_better', 'lower_better'] = 'higher_better',
) List[Dict[str, Any]][source]#

Aggregate crosslink-spectrum-matches to crosslinks.

Aggregates a list of crosslink-spectrum-matches to unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if by = "peptide", otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the crosslink of the first corresponding crosslink-spectrum -match in the list is kept instead.

Parameters:
  • csms (list of dict of str, any) – A list of crosslink-spectrum-matches.

  • by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. If protein crosslink position is not available for all crosslink-spectrum-matches a ValueError will be raised. Make sure that all crosslink-spectrum-matches have the _proteins and _proteins_crosslink_positions fields set. If this is not already done by the parser, this can be achieved with transform.reannotate_positions().

  • score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.

Returns:

A list of aggregated, unique crosslinks.

Return type:

list of dict of str, any

Warning

Aggregation will not conserve false discovery rate (FDR)! Aggregating crosslink-spectrum-matches that are validated for 1% FDR will not result in crosslinks validated for 1% FDR! Aggregated crosslinks should be validated with either external tools or with the built-in transform.validate()!

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.

  • TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.

  • ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import aggregate
>>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> aggregate_peptide = aggregate(pr["crosslink-spectrum-matches"], by="peptide")
>>> len(aggregate_peptide)
3
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import aggregate
>>> pr = read("data/_test/aggregate/csms.txt", engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> aggregate_protein = aggregate(pr["crosslink-spectrum-matches"], by="protein")
>>> len(aggregate_protein)
2
pyXLMS.transform_aggregate.unique(
data: List[Dict[str, Any]] | Dict[str, Any],
by: Literal['peptide', 'protein'] = 'peptide',
score: Literal['higher_better', 'lower_better'] = 'higher_better',
) List[Dict[str, Any]] | Dict[str, Any][source]#

Filter for unique crosslinks or crosslink-spectrum-matches.

Filters for unique crosslinks from a list on non-unique crosslinks. A crosslink is considered unique if there is no other crosslink with the same peptide sequence and crosslink position if by = "peptide", otherwise it is considered unique if there are no other crosslinks with the same protein crosslink position (residue pair). If more than one crosslink exists per peptide sequence/residue pair, the one with the better/best score is kept and the rest is filtered out. If crosslinks without scores are provided, the first crosslink in the list is kept instead.

or

Filters for unique crosslink-spectrum-matches from a list on non-unique crosslink-spectrum-matches. A crosslink- spectrum-match is considered unique if there is no other crosslink-spectrum-match from the same spectrum file and with the same scan number. If more than one crosslink-spectrum-match exists per spectrum file and scan number, the one with the better/best score is kept and the rest is filtered out. If crosslink-spectrum-matches without scores are provided, the first crosslink-spectrum-match in the list is kept instead.

Parameters:
  • data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to filter, or a parser_result.

  • by (str, one of "peptide" or "protein", default = "peptide") – If peptide or protein crosslink position should be used for determining if a crosslink is unique. Only affects filtering for unique crosslinks and not crosslink-spectrum-matches. If protein crosslink position is not available for all crosslinks a ValueError will be raised. Make sure that all crosslinks have the _proteins and _proteins_crosslink_positions fields set. If this is not already done by the parser, this can be achieved with transform.reannotate_positions().

  • score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of unique crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with unique crosslink-spectrum-matches and/or unique crosslinks will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If parameter by is not one of ‘peptide’ or ‘protein’.

  • TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.

  • ValueError – If parameter by is set to ‘protein’ but protein crosslink positions are not available.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import unique
>>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> len(pr["crosslinks"])
10
>>> unique_peptide = unique(pr, by="peptide")
>>> len(unique_peptide["crosslink-spectrum-matches"])
5
>>> len(unique_peptide["crosslinks"])
3
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import unique
>>> pr = read(["data/_test/aggregate/csms.txt", "data/_test/aggregate/xls.txt"], engine="custom", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
10
>>> len(pr["crosslinks"])
10
>>> unique_protein = unique(pr, by="protein")
>>> len(unique_protein["crosslink-spectrum-matches"])
5
>>> len(unique_protein["crosslinks"])
2

pyXLMS.transform_filter module#

Separate crosslinks and crosslink-spectrum-matches by their crosslink type.

Gets all crosslinks or crosslink-spectrum-matches depending on crosslink type. Will separate based on if a crosslink or crosslink-spectrum-match is of type “intra” or “inter” crosslink.

Parameters:

data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.

Returns:

Returns a dictionary with key Intra which contains all crosslinks or crosslink-spectrum- matches with crosslink type = “intra”, and key Inter which contains all crosslinks or crosslink-spectrum-matches with crosslink type = “inter”.

Return type:

dict of str, list of dict

Raises:

TypeError – If an unsupported data type is provided.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_crosslink_type
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> crosslink_type_filtered_csms = filter_crosslink_type(result["crosslink-spectrum-matches"])
>>> len(crosslink_type_filtered_csms["Intra"])
803
>>> len(crosslink_type_filtered_csms["Inter"])
23
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_crosslink_type
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> crosslink_type_filtered_crosslinks = filter_crosslink_type(result["crosslinks"])
>>> len(crosslink_type_filtered_crosslinks["Intra"])
279
>>> len(crosslink_type_filtered_crosslinks["Inter"])
21
pyXLMS.transform_filter.filter_proteins(
data: List[Dict[str, Any]],
proteins: Set[str] | List[str],
) Dict[str, List[Any]][source]#

Get all crosslinks or crosslink-spectrum-matches originating from proteins of interest.

Gets all crosslinks or crosslink-spectrum-matches originating from a list of proteins of interest and returns a list of crosslinks or crosslink-spectrum-matches where both peptides come from a protein of interest and a list of crosslinks or crosslink-spectrum-matches where one of the peptides comes from a protein of interest.

Parameters:
  • data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.

  • proteins (set of str, or list of str) – A set of protein accessions of interest.

Returns:

Returns a dictionary with key Proteins which contains the list of proteins of interest, key Both which contains all crosslinks or crosslink-spectrum-matches where both peptides are originating from a protein of interest, and key One which contains all crosslinks or crosslink-spectrum-matches where one of the two peptides is originating from a protein of interest.

Return type:

dict

Raises:

TypeError – If an unsupported data type is provided.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_proteins
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> proteins_csms = filter_proteins(result["crosslink-spectrum-matches"], ["Cas9"])
>>> proteins_csms["Proteins"]
['Cas9']
>>> len(proteins_csms["Both"])
798
>>> len(proteins_csms["One"])
23
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_proteins
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> proteins_xls = filter_proteins(result["crosslinks"], ["Cas9"])
>>> proteins_xls["Proteins"]
['Cas9']
>>> len(proteins_xls["Both"])
274
>>> len(proteins_xls["One"])
21
pyXLMS.transform_filter.filter_target_decoy(
data: List[Dict[str, Any]],
) Dict[str, List[Dict[str, Any]]][source]#

Seperate crosslinks or crosslink-spectrum-matches based on target and decoy matches.

Seperates crosslinks or crosslink-spectrum-matches based on if both peptides match to the target database, or if both match to the decoy database, or if one of them matches to the target database and the other to the decoy database. The first we denote as “Target-Target” or “TT” matches, the second as “Decoy-Decoy” or “DD” matches, and the third as “Target-Decoy” or “TD” matches.

Parameters:

data (list of dict of str, any) – A list of pyXLMS crosslinks or crosslink-spectrum-matches.

Returns:

Returns a dictionary with key Target-Target which contains all TT matches, key Target-Decoy which contains all TD matches, and key Decoy-Decoy which contains all DD matches.

Return type:

dict

Raises:

TypeError – If an unsupported data type is provided.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_target_decoy
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> target_and_decoys = filter_target_decoy(result["crosslink-spectrum-matches"])
>>> len(target_and_decoys["Target-Target"])
786
>>> len(target_and_decoys["Target-Decoy"])
39
>>> len(target_and_decoys["Decoy-Decoy"])
1
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import filter_target_decoy
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> target_and_decoys = filter_target_decoy(result["crosslinks"])
>>> len(target_and_decoys["Target-Target"])
265
>>> len(target_and_decoys["Target-Decoy"])
0
>>> len(target_and_decoys["Decoy-Decoy"])
35

pyXLMS.transform_reannotate_positions module#

pyXLMS.transform_reannotate_positions.fasta_title_to_accession(title: str) str[source]#

Parses the protein accession from a UniProt-like title.

Parameters:

title (str) – Fasta title/header.

Returns:

The protein accession parsed from the title. If parsing was unsuccessful the full title is returned.

Return type:

str

Examples

>>> from pyXLMS.transform import fasta_title_to_accession
>>> title = "sp|A0A087X1C5|CP2D7_HUMAN Putative cytochrome P450 2D7 OS=Homo sapiens OX=9606 GN=CYP2D7 PE=5 SV=1"
>>> fasta_title_to_accession(title)
'A0A087X1C5'
>>> from pyXLMS.transform import fasta_title_to_accession
>>> title = "Cas9"
>>> fasta_title_to_accession(title)
'Cas9'
pyXLMS.transform_reannotate_positions.reannotate_positions(
data: List[Dict[str, Any]] | Dict[str, Any],
fasta: str | BinaryIO,
title_to_accession: Callable[[str], str] | None = None,
) List[Dict[str, Any]] | Dict[str, Any][source]#

Reannotates protein crosslink positions for a given fasta file.

Reannotates the crosslink and peptide positions of the given cross-linked peptide pair and the specified fasta file. Takes a list of crosslink-spectrum-matches or crosslinks, or a parser_result as input.

Parameters:
  • data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to annotate, or a parser_result.

  • fasta (str, or file stream) – The name/path of the fasta file containing protein sequences or a file-like object/stream.

  • title_to_accession (callable, or None, default = None) – A function that parses the protein accession from the fasta title/header. If None (default) the function fasta_title_to_accession is used.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of annotated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an annotated parser_result will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:

TypeError – If a wrong data type is provided.

Examples

>>> from pyXLMS.data import create_crosslink_min
>>> from pyXLMS.transform import reannotate_positions
>>> xls = [create_crosslink_min("ADANLDK", 7, "GNTDRHSIK", 9)]
>>> xls = reannotate_positions(xls, "data/_fasta/Cas9_plus10.fasta")
>>> xls[0]["alpha_proteins"]
["Cas9"]
>>> xls[0]["alpha_proteins_crosslink_positions"]
[1293]
>>> xls[0]["beta_proteins"]
["Cas9"]
>>> xls[0]["beta_proteins_crosslink_positions"]
[48]

pyXLMS.transform_summary module#

pyXLMS.transform_summary.summary(
data: List[Dict[str, Any]] | Dict[str, Any],
) Dict[str, float][source]#

Extracts summary stats from a list of crosslinks or crosslink-spectrum-matches, or a parser_result.

Extracts summary statistics from a list of crosslinks or crosslink-spectrum-matches, or a parser_result. The statistic depend on the supplied data type, if a list of crosslinks is supplied a dictionary with the following statistics and keys is returned:

  • Number of crosslinks

  • Number of unique crosslinks by peptide

  • Number of unique crosslinks by protein

  • Number of intra crosslinks

  • Number of inter crosslinks

  • Number of target-target crosslinks

  • Number of target-decoy crosslinks

  • Number of decoy-decoy crosslinks

  • Minimum crosslink score

  • Maximum crosslink score

If a list of crosslink-spectrum-matches is supplied dictionary with the following statistics and keys is returned:

  • Number of CSMs

  • Number of unique CSMs

  • Number of intra CSMs

  • Number of inter CSMs

  • Number of target-target CSMs

  • Number of target-decoy CSMs

  • Number of decoy-decoy CSMs

  • Minimum CSM score

  • Maximum CSM score

If a parser_result is supplied, a dictionary with both containing all of these is returned - if they are available. A parser_result that only contains crosslinks will only yield a dictionary with crosslink statistics, and vice versa a parser_result that only contains crosslink-spectrum-matches will only yield a dictionary with crosslink-spectrum- match statistics. If the parser_result result contains both, then both dictionaries will be merged and returned. Please note that in this case a single dictionary is returned, that contains both the keys for crosslinks and crosslink-spectrum-matches.

Statistics are also printed to stdout.

Parameters:

data (list of dict of str, any, or dict of str, any) – A list of crosslinks or crosslink-spectrum-matches, or a parser_result.

Returns:

A dictionary with summary statistics.

Return type:

dict of str, float

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import summary
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> csms = pr["crosslink-spectrum-matches"]
>>> stats = summary(csms)
Number of CSMs: 826.0
Number of unique CSMs: 826.0
Number of intra CSMs: 803.0
Number of inter CSMs: 23.0
Number of target-target CSMs: 786.0
Number of target-decoy CSMs: 39.0
Number of decoy-decoy CSMs: 1.0
Minimum CSM score: 1.11
Maximum CSM score: 452.99
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import summary
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> stats = summary(pr)
Number of crosslinks: 300.0
Number of unique crosslinks by peptide: 300.0
Number of unique crosslinks by protein: 298.0
Number of intra crosslinks: 279.0
Number of inter crosslinks: 21.0
Number of target-target crosslinks: 265.0
Number of target-decoy crosslinks: 0.0
Number of decoy-decoy crosslinks: 35.0
Minimum crosslink score: 1.11
Maximum crosslink score: 452.99

pyXLMS.transform_targets_only module#

pyXLMS.transform_targets_only.targets_only(
data: List[Dict[str, Any]] | Dict[str, Any],
) List[Dict[str, Any]] | Dict[str, Any][source]#

Get target crosslinks or crosslink-spectrum-matches.

Get target crosslinks or crosslink-spectrum-matches from a list of target and decoy crosslinks or crosslink-spectrum-matches, or a parser_result. This effectively filters out any target-decoy and decoy-decoy matches and is essentially a convenience wrapper for transform.filter_target_decoy()["Target-Target"].

Parameters:

data (dict of str, any, or list of dict of str, any) – A list of crosslink-spectrum-matches or crosslinks, or a parser_result.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of target crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, a parser_result with target crosslink-spectrum-matches and/or target crosslinks will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:
  • TypeError – If a wrong data type is provided.

  • RuntimeError – If no target crosslinks or crosslink-spectrum-matches were found.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> targets = targets_only(result["crosslink-spectrum-matches"])
>>> len(targets)
786
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> result = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx", engine="MS Annika", crosslinker="DSS")
>>> targets = targets_only(result["crosslinks"])
>>> len(targets)
265
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import targets_only
>>> result = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS")
>>> result_targets = targets_only(result)
>>> len(result_targets["crosslink-spectrum-matches"])
786
>>> len(result_targets["crosslinks"])
265

pyXLMS.transform_to_dataframe module#

pyXLMS.transform_to_dataframe.to_dataframe(
data: List[Dict[str, Any]],
) DataFrame[source]#

Returns a pandas DataFrame of the given crosslinks or crosslink-spectrum-matches.

Parameters:

data (list) – A list of crosslinks or crosslink-spectrum-matches as created by data.create_crosslink() or data.create_csm().

Returns:

The pandas DataFrame created from the list of input crosslinks or crosslink-spectrum-matches. A full specification of the returned DataFrame can be found in the docs.

Return type:

pandas.DataFrame

Raises:
  • TypeError – If the list does not contain crosslinks or crosslink-spectrum-matches.

  • ValueError – If the list does not contain any objects.

Examples

>>> from pyXLMS.transform import to_dataframe
>>> # assume that crosslinks is a list of crosslinks created by data.create_crosslink()
>>> crosslink_dataframe = to_dataframe(crosslinks)
>>> # assume csms is a list of crosslink-spectrum-matches created by data.create_csm()
>>> csm_dataframe = to_dataframe(csms)

pyXLMS.transform_to_proforma module#

pyXLMS.transform_to_proforma.to_proforma(
data: Dict[str, Any] | List[Dict[str, Any]],
crosslinker: str | float | None = None,
) str | List[str][source]#

Returns the Proforma string for a single crosslink or crosslink-spectrum-match, or for a list of crosslinks or crosslink-spectrum-matches.

Parameters:
  • data (dict of str, any, or list of dict of str, any) – A pyXLMS crosslink object, e.g. see data.create_crosslink(). Or a pyXLMS crosslink-spectrum-match object, e.g. see data.create_csm(). Alternatively, a list of crosslinks or crosslink-spectrum-matches can also be provided.

  • crosslinker (str, or float, or None, default = None) – Optional name or mass of the crosslink reagent. If the name is given, it should be a valid name from XLMOD. If the crosslink modification is contained in the crosslink-spectrum-match object this parameter has no effect.

Returns:

The Proforma string of the crosslink or crosslink-spectrum-match. If a list was provided a list containing all Proforma strings is returned.

Return type:

str

Raises:

TypeError – If an unsupported data type is provided.

Notes

  • Modifications with unknown mass are skipped.

  • If no modifications are given, only the crosslink modification will be encoded in the Proforma.

  • If no modifications are given and no crosslinker is given, the unmodified peptide Proforma will be returned.

Examples

>>> from pyXLMS.data import create_crosslink_min
>>> from pyXLMS.transform import to_proforma
>>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1)
>>> to_proforma(xl)
'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_crosslink_min
>>> from pyXLMS.transform import to_proforma
>>> xl = create_crosslink_min("PEPKTIDE", 4, "KPEPTIDE", 1)
>>> to_proforma(xl, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1)
>>> to_proforma(csm)
'KPEPTIDE//PEPKTIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPEPTIDE", 1, "RUN_1", 1)
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PEPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)})
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_b={3:("Oxidation", 15.994915)}, charge=3)
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[Xlink:DSSO]PM[+15.994915]EPTIDE//PEPK[Xlink:DSSO]TIDE/3'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3)
>>> to_proforma(csm)
'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'
>>> from pyXLMS.data import create_csm_min
>>> from pyXLMS.transform import to_proforma
>>> csm = create_csm_min("PEPKTIDE", 4, "KPMEPTIDE", 1, "RUN_1", 1, modifications_a={4:("DSSO", 158.00376)}, modifications_b={1:("DSSO", 158.00376), 3:("Oxidation", 15.994915)}, charge=3)
>>> to_proforma(csm, crosslinker="Xlink:DSSO")
'K[+158.00376]PM[+15.994915]EPTIDE//PEPK[+158.00376]TIDE/3'

pyXLMS.transform_util module#

pyXLMS.transform_util.assert_data_type_same(data_list: List[Dict[str, Any]]) bool[source]#

Checks that all data is of the same data type.

Verifies that all elements in the provided list are of the same data type.

Parameters:

data_list (list of dict of str, any) – A list of dictionaries with the data_type key.

Returns:

If all elements are of the same data type.

Return type:

bool

Examples

>>> from pyXLMS.transform import assert_data_type_same
>>> from pyXLMS import data
>>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)]
>>> assert_data_type_same(data_list)
True
>>> from pyXLMS.transform import assert_data_type_same
>>> from pyXLMS import data
>>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_csm_min("KPEP", 1, "PEKP", 3, "RUN_1", 1)]
>>> assert_data_type_same(data_list)
False
pyXLMS.transform_util.get_available_keys(
data_list: List[Dict[str, Any]],
) Dict[str, bool][source]#

Checks which data is available from a list of crosslinks or crosslink-spectrum-matches.

Verifies which data fields have been set for all crosslinks or crosslink-spectrum-matches in the given list. Will return a dictionary structured the same as a crosslink or crosslink-spectrum-match, but instead of the data it will return either True or False, depending if the field was set or not.

Parameters:

data_list (list of dict of str, any) – A list of crosslinks or crosslink-spectrum-matches.

Returns:

  • If a list of crosslinks was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslinks in data_list. Keys: data_type, completeness, alpha_peptide, alpha_peptide_crosslink_position, alpha_proteins, alpha_proteins_crosslink_positions, alpha_decoy, beta_peptide, beta_peptide_crosslink_position, beta_proteins, beta_proteins_crosslink_positions, beta_decoy, crosslink_type, score, and additional_information.

  • If a list of crosslink-spectrum-matches was provided, a dictionary with the following keys will be returned, where the value of each key denotes if the data field is available for all crosslink-spectrum-matches in data_list. Keys: data_type, completeness, alpha_peptide, alpha_modifications, alpha_peptide_crosslink_position, alpha_proteins, alpha_proteins_crosslink_positions, alpha_proteins_peptide_positions, alpha_score, alpha_decoy, beta_peptide, beta_modifications, beta_peptide_crosslink_position, beta_proteins, beta_proteins_crosslink_positions, beta_proteins_peptide_positions, beta_score, beta_decoy, crosslink_type, score, spectrum_file, scan_nr, retention_time, ion_mobility, and additional_information.

Return type:

dict of str, bool

Raises:
  • TypeError – If not all elements in data_list are of the same data type.

  • TypeError – If one or more elements in the list are of an unsupported data type.

Examples

>>> from pyXLMS.transform import get_available_keys
>>> from pyXLMS import data
>>> data_list = [data.create_crosslink_min("PEPK", 4, "PKEP", 2), data.create_crosslink_min("KPEP", 1, "PEKP", 3)]
>>> available_keys = get_available_keys(data_list)
>>> available_keys["alpha_peptide"]
True
>>> available_keys["score"]
False
pyXLMS.transform_util.modifications_to_str(
modifications: Dict[int, Tuple[str, float]] | None,
) str | None[source]#

Returns the string representation of a modifications dictionary.

Parameters:

modifications (dict of [str, tuple], or None) – The modifications of a peptide given as a dictionary that maps peptide position (1-based) to modification given as a tuple of modification name and modification delta mass. N-terminal modifications should be denoted with position 0. C-terminal modifications should be denoted with position len(peptide) + 1.

Returns:

The string representation of the modifications (or None if no modification was provided).

Return type:

str, or None

Examples

>>> from pyXLMS.transform import modifications_to_str
>>> modifications_to_str({1: ("Oxidation", 15.994915), 5: ("Carbamidomethyl", 57.021464)})
'(1:[Oxidation|15.994915]);(5:[Carbamidomethyl|57.021464])'

pyXLMS.transform_validate module#

pyXLMS.transform_validate.validate(
data: List[Dict[str, Any]] | Dict[str, Any],
fdr: float = 0.01,
formula: Literal['D/T', '(TD+DD)/TT', '(TD-DD)/TT'] = 'D/T',
score: Literal['higher_better', 'lower_better'] = 'higher_better',
separate_intra_inter: bool = False,
ignore_missing_labels: bool = False,
) List[Dict[str, Any]] | Dict[str, Any][source]#

Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate.

Validate a list of crosslinks or crosslink-spectrum-matches, or a parser_result by estimating false discovery rate (FDR) using the defined formula. Requires that “score”, “alpha_decoy” and “beta_decoy” fields are set for crosslinks and crosslink-spectrum-matches.

Parameters:
  • data (list of dict of str, any, or dict of str, any) – A list of crosslink-spectrum-matches or crosslinks to validate, or a parser_result.

  • fdr (float, default = 0.01) – The target FDR, must be given as a real number between 0 and 1. The default of 0.01 corresponds to 1% FDR.

  • formula (str, one of "D/T", "(TD+DD)/TT", or "(TD-DD)/TT", default = "D/T") – Which formula to use to estimate FDR. D and DD denote decoy matches, T and TT denote target matches, and TD denotes target-decoy and decoy-target matches.

  • score (str, one of "higher_better" or "lower_better", default = "higher_better") – If a higher score is considered better, or a lower score is considered better.

  • separate_intra_inter (bool, default = False) – If FDR should be estimated separately for intra and inter matches.

  • ignore_missing_labels (bool, default = False) – If crosslinks and crosslink-spectrum-matches should be ignored if they don’t have target and decoy labels. By default and error is thrown if any unlabelled data is encountered.

Returns:

If a list of crosslink-spectrum-matches or crosslinks was provided, a list of validated crosslink-spectrum-matches or crosslinks is returned. If a parser_result was provided, an parser_result with validated crosslink-spectrum-matches and/or validated crosslinks will be returned.

Return type:

list of dict of str, any, or dict of str, any

Raises:
  • TypeError – If a wrong data type is provided.

  • TypeError – If parameter formula is not one of ‘D/T’, ‘(TD+DD)/TT’, or ‘(TD-DD)/TT’.

  • TypeError – If parameter score is not one of ‘higher_better’ or ‘lower_better’.

  • ValueError – If parameter fdr is outside of the supported range.

  • ValueError – If attribute ‘score’ is not available for any of the data.

  • ValueError – If attribute ‘alpha_decoy’ or ‘beta_decoy’ is not available for any of the data and parameter ignore_missing_labels is set to False.

  • ValueError – If the number of DD matches exceeds the number of TD matches for formula ‘(TD-DD)/TT’. FDR can not be estimated with the formula ‘(TD-DD)/TT’ in these cases.

Notes

Please note that progress bars will usually not complete when running this function. This is by design as it is not necessary to iterate over all scores to estimate FDR.

Examples

>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import validate
>>> pr = read("data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", engine="MS Annika", crosslinker="DSS")
>>> csms = pr["crosslink-spectrum-matches"]
>>> len(csms)
826
>>> validated = validate(csms)
>>> len(validated)
705
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import validate
>>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
826
>>> len(pr["crosslinks"])
300
>>> validated = validate(pr)
>>> len(validated["crosslink-spectrum-matches"])
705
>> len(validated["crosslinks"])
226
>>> from pyXLMS.parser import read
>>> from pyXLMS.transform import validate
>>> pr = read(["data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_CSMs.xlsx", "data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1_Crosslinks.xlsx"], engine="MS Annika", crosslinker="DSS")
>>> len(pr["crosslink-spectrum-matches"])
826
>>> len(pr["crosslinks"])
300
>>> validated = validate(pr, fdr=0.05)
>>> len(validated["crosslink-spectrum-matches"])
825
>> len(validated["crosslinks"])
260

Module contents#