Core package¶

Core entities for hierarchy construction

class Contact(res1_seq, res2_seq, raw_score, distance_bound=(0, 8))[source]¶

Bases: conkit.core._Entity

A contact pair template to store all associated information

Examples

>>> from conkit.core import Contact
>>> contact = Contact(1, 25, 1.0)
>>> print(contact)
Contact(id="(1, 25)" res1="A" res1_seq=1 res2="A" res2_seq=25 raw_score=1.0)

Attributes

`distance_bound`	The lower and upper distance boundary values of a contact pair in Ångstrom [Default: 0-8Å].
`id`	The ID of the selected entity
`is_match`	A boolean status for the contact
`is_mismatch`	A boolean status for the contact
`is_unknown`	A boolean status for the contact
`lower_bound`	The lower distance boundary value
`raw_score`	The prediction score for the contact pair
`res1`	The amino acid of residue 1 [default: X]
`res2`	The amino acid of residue 2 [default: X]
`res1_chain`	The chain for residue 1
`res2_chain`	The chain for residue 2
`res1_seq`	The residue sequence number of residue 1
`res2_seq`	The residue sequence number of residue 2
`res1_altseq`	The alternative residue sequence number of residue 1
`res2_altseq`	The alternative residue sequence number of residue 2
`scalar_score`	The raw_score scaled according to the average `raw_score`
`status`	An indication of the residue status, i.e true positive, false positive, or unknown
`upper_bound`	The upper distance boundary value
`weight`	A separate internal weight factor for the contact pair

Methods

`add`(entity)	Add a child to the `Entity`
`copy`()	Create a shallow copy of `Entity`
`deepcopy`()	Create a deep copy of `Entity`
`define_match`()	Define a contact as matching contact
`define_mismatch`()	Define a contact as mismatching contact
`define_unknown`()	Define a contact with unknown status
`remove`(id)	Remove a child

define_match()[source]¶: Define a contact as matching contact

define_mismatch()[source]¶: Define a contact as mismatching contact

define_unknown()[source]¶: Define a contact with unknown status

distance_bound¶: The lower and upper distance boundary values of a contact pair in Ångstrom [Default: 0-8Å].

is_match¶: A boolean status for the contact

is_mismatch¶: A boolean status for the contact

is_unknown¶: A boolean status for the contact

lower_bound¶: The lower distance boundary value

raw_score¶: The prediction score for the contact pair

res1¶: The amino acid of residue 1 [default: X]

res1_altseq¶: The alternative residue sequence number of residue 1

res1_chain¶: The chain for residue 1

res1_seq¶: The residue sequence number of residue 1

res2¶: The amino acid of residue 2 [default: X]

res2_altseq¶: The alternative residue sequence number of residue 2

res2_chain¶: The chain for residue 2

res2_seq¶: The residue sequence number of residue 2

scalar_score¶: The raw_score scaled according to the average raw_score

status¶: An indication of the residue status, i.e true positive, false positive, or unknown

upper_bound¶: The upper distance boundary value

weight¶: A separate internal weight factor for the contact pair

class ContactFile(id)[source]¶

Bases: conkit.core._Entity

A contact file object representing a single prediction file

The contact file class represents a data structure to hold all predictions with a single contact map file. It contains functions to store, manipulate and organise contact maps.

Examples

>>> from conkit.core import ContactMap, ContactFile
>>> contact_file = ContactFile("example")
>>> contact_file.add(ContactMap("foo"))
>>> contact_file.add(ContactMap("bar"))
>>> print(contact_file)
ContactFile(id="example" nseqs=2)

Attributes

`author`	The author of the `ContactFile`
`method`	The `ContactFile`-specific method
`remark`	The `ContactFile`-specific remarks
`target`	The target name
`top_map`	The first `ContactMap` entry in `ContactFile`

Methods

`add`(entity)	Add a child to the `Entity`
`copy`()	Create a shallow copy of `Entity`
`deepcopy`()	Create a deep copy of `Entity`
`remove`(id)	Remove a child
`sort`(kword[, reverse, inplace])	Sort the `ContactFile`

author¶: The author of the ContactFile

method¶: The ContactFile-specific method

remark¶: The ContactFile-specific remarks

sort(kword, reverse=False, inplace=False)[source]¶

Sort the ContactFile

Parameters:

Parameters:	kword : str The dictionary key to sort contacts by reverse : bool, optional Sort the contact pairs in descending order [default: False] inplace : bool, optional Replace the saved order of contacts [default: False]
Returns:	contact_map : `ContactMap` The reference to the `ContactMap`, regardless of inplace
Raises:	ValueError `kword` not in `ContactFile`

kword : str

The dictionary key to sort contacts by

reverse : bool, optional

Sort the contact pairs in descending order [default: False]

inplace : bool, optional

Replace the saved order of contacts [default: False]

Returns:

contact_map : ContactMap

The reference to the ContactMap, regardless of inplace

Raises:

ValueError

kword not in ContactFile

target¶: The target name

top_map¶

The first ContactMap entry in ContactFile

Returns:

Returns:	top_map : `ContactMap`, None The first `ContactMap` entry in `ContactFile`

top_map : ContactMap, None

The first ContactMap entry in ContactFile

class ContactMap(id)[source]¶

Bases: conkit.core._Entity

A contact map object representing a single prediction

The ContactMap class represents a data structure to hold a single contact map prediction in one place. It contains functions to store, manipulate and organise Contact instances.

Examples

>>> from conkit.core import Contact, ContactMap
>>> contact_map = ContactMap("example")
>>> contact_map.add(Contact(1, 10, 0.333))
>>> contact_map.add(Contact(5, 30, 0.667))
>>> print(contact_map)
ContactMap(id="example" ncontacts=2)

Attributes

`coverage`	The sequence coverage score
`id`	The ID of the selected entity
`ncontacts`	The number of `Contact` instances in the `ContactMap`
`precision`	The precision (Positive Predictive Value) score
`repr_sequence`	The representative `Sequence` associated with the `ContactMap`
`repr_sequence_altloc`	The representative altloc `Sequence` associated with the `ContactMap`
`sequence`	The `Sequence` associated with the `ContactMap`
`top_contact`	The first `Contact` entry in `ContactMap`

Methods

`add`(entity)	Add a child to the `Entity`
`assign_sequence_register`([altloc])	Assign the amino acids from `Sequence` to all `Contact` instances
`calculate_jaccard_index`(other)	Calculate the Jaccard index between two `ContactMap` instances
`calculate_kernel_density`([bw_method])	Calculate the contact density in the contact map using Gaussian kernels
`calculate_scalar_score`()	Calculate a scaled score for the `ContactMap`
`copy`()	Create a shallow copy of `Entity`
`deepcopy`()	Create a deep copy of `Entity`
`find`(indexes[, altloc])	Find all contacts associated with `index`
`match`(other[, remove_unmatched, renumber, ...])	Modify both hierarchies so residue numbers match one another.
`remove`(id)	Remove a child
`remove_neighbors`([min_distance, inplace])	Remove contacts between neighboring residues
`rescale`([inplace])	Rescale the raw scores in `ContactMap`
`sort`(kword[, reverse, inplace])	Sort the `ContactMap`

assign_sequence_register(altloc=False)[source]¶

Assign the amino acids from Sequence to all Contact instances

Parameters:

Parameters:	altloc : bool Use the res_altloc positions [default: False]

altloc : bool

Use the res_altloc positions [default: False]

calculate_jaccard_index(other)[source]¶

Calculate the Jaccard index between two ContactMap instances

This score analyzes the difference of the predicted contacts from two maps,

\[J_{x,y}=\frac{\left|x \cap y\right|}{\left|x \cup y\right|}\]

where \(x\) and \(y\) are the sets of predicted contacts from two different predictors, \(\left|x \cap y\right|\) is the number of elements in the intersection of \(x\) and \(y\), and the \(\left|x \cup y\right|\) represents the number of elements in the union of \(x\) and \(y\).

The J-score has values in the range of \([0, 1]\), with a value of \(1\) corresponding to identical contact maps and \(0\) to dissimilar ones.

Parameters:

Parameters:	other : `ContactMap` A ConKit `ContactMap`
Returns:	float The Jaccard distance

other : ContactMap

A ConKit ContactMap

Returns:

float

The Jaccard distance

Warning

The Jaccard distance ranges from \([0, 1]\), where \(1\) means the maps contain identical contacts pairs.

See also

match, precision

Notes

The Jaccard index is different from the Jaccard distance mentioned in [1]. The Jaccard distance corresponds to \(1-Jaccard_{index}\).

[1]	Q. Wuyun, W. Zheng, Z. Peng, J. Yang (2016). A large-scale comparative assessment of methods for residue-residue contact prediction. Briefings in Bioinformatics, [doi: 10.1093/bib/bbw106].

calculate_kernel_density(bw_method='amise')[source]¶

Calculate the contact density in the contact map using Gaussian kernels

Various algorithms can be used to estimate the bandwidth. To calculate the bandwidth for an 1D data array X with n data points and d dimensions, the listed algorithms have been implemented. Please note, in rules 2 and 3, the value of \(\sigma\) is the smaller of the standard deviation of X or the normalized interquartile range.

Asymptotic Mean Integrated Squared Error (AMISE)

This particular choice of bandwidth recovers all the important features whilst maintaining smoothness. It is a direct implementation of the method used by [2].
Bowman & Azzalini [3] implementation

\[\sqrt{\frac{\sum{X}^2}{n}-(\frac{\sum{X}}{n})^2}*(\frac{(d+2)*n}{4})^\frac{-1}{d+4}\]

Scott’s [4] implementation

\[1.059*\sigma*n^\frac{-1}{d+4}\]

Silverman’s [5] implementation

\[0.9*\sigma*(n*\frac{d+2}{4})^\frac{-1}{d+4}\]

[2]	Sadowski, M.I. (2013). Prediction of protein domain boundaries from inverse covariances.

[3]	Bowman, A.W. & Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis.

[4]	Scott, D.W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization.

[5]	Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis.

Parameters:

Parameters:	bw_method : str, optional The bandwidth estimator to use [default: amise_sekant]
Returns:	list The list of per-residue density estimates
Raises:	RuntimeError Cannot find SciKit package ValueError Undefined bandwidth method

bw_method : str, optional

The bandwidth estimator to use [default: amise_sekant]

Returns:

list

The list of per-residue density estimates

Raises:

RuntimeError

Cannot find SciKit package

ValueError

Undefined bandwidth method

calculate_scalar_score()[source]¶

Calculate a scaled score for the ContactMap

This score is a scaled score for all raw scores in a contact map. It is defined by the formula

\[{x}'=\frac{x}{\overline{d}}\]

where \(x\) corresponds to the raw score of each predicted contact and \(\overline{d}\) to the mean of all raw scores.

The score is saved in a separate Contact attribute called scalar_score

This score is described in more detail in [6].

[6]	S. Ovchinnikov, L. Kinch, H. Park, Y. Liao, J. Pei, D.E. Kim, H. Kamisetty, N.V. Grishin, D. Baker (2015). Large-scale determination of previously unsolved protein structures using evolutionary information. Elife 4, e09248.

coverage¶

The sequence coverage score

The coverage score is calculated by analysing the number of residues covered by the predicted contact pairs.

\[Coverage=\frac{x_{cov}}{L}\]

The coverage score is calculated by dividing the number of contacts \(x_{cov}\) by the number of residues in the sequence \(L\).

Returns:

Returns:	cov : float The calculated coverage score

cov : float

The calculated coverage score

See also

precision

find(indexes, altloc=False)[source]¶

Find all contacts associated with index

Parameters:

Parameters:	index : list, tuple A list of residue indexes to find altloc : bool Use the res_altloc positions [default: False]
Returns:	`ContactMap` A modified version of the contact map containing the found contacts

index : list, tuple

A list of residue indexes to find

altloc : bool

Use the res_altloc positions [default: False]

Returns:

ContactMap

A modified version of the contact map containing the found contacts

match(other, remove_unmatched=False, renumber=False, inplace=False)[source]¶

Modify both hierarchies so residue numbers match one another.

This function is key when plotting contact maps or visualising contact maps in 3-dimensional space. In particular, when residue numbers in the structure do not start at count 0 or when peptide chain breaks are present.

Parameters:

Parameters:	other : `ContactMap` A ConKit `ContactMap` remove_unmatched : bool, optional Remove all unmatched contacts [default: False] renumber : bool, optional Renumber the res_seq entries [default: False] If `True`, `res1_seq` and `res2_seq` changes but `id` remains the same inplace : bool, optional Replace the saved order of contacts [default: False]
Returns:	hierarchy_mod `ContactMap` instance, regardless of inplace
Raises:	ValueError Error creating reliable keymap matching the sequence in `ContactMap`

other : ContactMap

A ConKit ContactMap

remove_unmatched : bool, optional

Remove all unmatched contacts [default: False]

renumber : bool, optional

Renumber the res_seq entries [default: False]

If True, res1_seq and res2_seq changes but id remains the same

inplace : bool, optional

Replace the saved order of contacts [default: False]

Returns:

hierarchy_mod

ContactMap instance, regardless of inplace

Raises:

ValueError

Error creating reliable keymap matching the sequence in ContactMap

ncontacts¶

The number of Contact instances in the ContactMap

Returns:

Returns:	ncontacts : int The number of sequences in the `ContactMap`

ncontacts : int

The number of sequences in the ContactMap

precision¶

The precision (Positive Predictive Value) score

The precision value is calculated by analysing the true and false postive contacts.

\[Precision=\frac{TruePositives}{TruePositives - FalsePositives}\]

The status of each contact, i.e true or false positive status, can be determined by running the match() function providing a reference structure.

Returns:

Returns:	ppv : float The calculated precision score

ppv : float

The calculated precision score

See also

coverage

remove_neighbors(min_distance=5, inplace=False)[source]¶

Remove contacts between neighboring residues

Parameters:

Parameters:	min_distance : int, optional The minimum number of residues between contacts [default: 5] inplace : bool, optional Replace the saved order of contacts [default: False]
Returns:	contact_map : `ContactMap` The reference to the `ContactMap`, regardless of inplace

min_distance : int, optional

The minimum number of residues between contacts [default: 5]

inplace : bool, optional

Replace the saved order of contacts [default: False]

Returns:

contact_map : ContactMap

The reference to the ContactMap, regardless of inplace

repr_sequence¶

The representative Sequence associated with the ContactMap

The peptide sequence constructed from the available contacts using the normal res_seq positions

Returns:

Returns:	sequence : `conkit.coreSequence`
Raises:	TypeError Sequence undefined

sequence : conkit.coreSequence

Raises:

TypeError

Sequence undefined

See also

repr_sequence_altloc, sequence

repr_sequence_altloc¶

The representative altloc Sequence associated with the ContactMap

The peptide sequence constructed from the available contacts using the altloc res_seq positions

Returns:

Returns:	sequence : `Sequence`
Raises:	ValueError Sequence undefined

sequence : Sequence

Raises:

ValueError

Sequence undefined

See also

repr_sequence, sequence

rescale(inplace=False)[source]¶

Rescale the raw scores in ContactMap

Rescaling of the data is done to normalize the raw scores to be in the range [0, 1]. The formula to rescale the data is:

\[{x}'=\frac{x-min(d)}{max(d)-min(d)}\]

\(x\) is the original value and \(d\) are all values to be rescaled.

Parameters:

Parameters:	inplace : bool, optional Replace the saved order of contacts [default: False]
Returns:	contact_map : `ContactMap` The reference to the `ContactMap`, regardless of inplace

inplace : bool, optional

Replace the saved order of contacts [default: False]

Returns:

contact_map : ContactMap

The reference to the ContactMap, regardless of inplace

sequence¶

The Sequence associated with the ContactMap

Returns:	`Sequence`

See also

repr_sequence, repr_sequence_altloc

sort(kword, reverse=False, inplace=False)[source]¶

Sort the ContactMap

Parameters:

Parameters:	kword : str The dictionary key to sort contacts by reverse : bool, optional Sort the contact pairs in descending order [default: False] inplace : bool, optional Replace the saved order of contacts [default: False]
Returns:	contact_map : `ContactMap` The reference to the `ContactMap`, regardless of inplace
Raises:	ValueError `kword` not in `ContactMap`

kword : str

The dictionary key to sort contacts by

reverse : bool, optional

Sort the contact pairs in descending order [default: False]

inplace : bool, optional

Replace the saved order of contacts [default: False]

Returns:

contact_map : ContactMap

The reference to the ContactMap, regardless of inplace

Raises:

ValueError

kword not in ContactMap

top_contact¶

The first Contact entry in ContactMap

Returns:

Returns:	top_contact : `Contact`, None The first `Contact` entry in `ContactFile`

top_contact : Contact, None

The first Contact entry in ContactFile

class Sequence(id, seq)[source]¶

Bases: conkit.core._Entity

A sequence template to store all associated information

Examples

>>> from conkit.core import Sequence
>>> sequence_entry = Sequence("example", "ABCDEF")
>>> print(sequence_entry)
Sequence(id="example" seq="ABCDEF" seqlen=6)

Attributes

`id`	The ID of the selected entity
`remark`	The `Sequence`-specific remarks
`seq`	The protein sequence as `str`
`seq_len`	The protein sequence length

Methods

`add`(entity)	Add a child to the `Entity`
`align_global`(other[, id_chars, nonid_chars, ...])	Generate a global alignment between two `Sequence` instances
`align_local`(other[, id_chars, nonid_chars, ...])	Generate a local alignment between two `Sequence` instances
`copy`()	Create a shallow copy of `Entity`
`deepcopy`()	Create a deep copy of `Entity`
`remove`(id)	Remove a child

align_global(other, id_chars=2, nonid_chars=1, gap_open_pen=-0.5, gap_ext_pen=-0.1, inplace=False)[source]¶

Generate a global alignment between two Sequence instances

Parameters:

Parameters:	other : `Sequence` id_chars : int, optional nonid_chars : int, optional gap_open_pen : float, optional gap_ext_pen : float, optional inplace : bool, optional Replace the saved order of residues [default: False]
Returns:	`Sequence` The reference to the `Sequence`, regardless of inplace `Sequence` The reference to the `Sequence`, regardless of inplace

other : Sequence

id_chars : int, optional

nonid_chars : int, optional

gap_open_pen : float, optional

gap_ext_pen : float, optional

inplace : bool, optional

Replace the saved order of residues [default: False]

Returns:

Sequence

The reference to the Sequence, regardless of inplace

Sequence

The reference to the Sequence, regardless of inplace

align_local(other, id_chars=2, nonid_chars=1, gap_open_pen=-0.5, gap_ext_pen=-0.1, inplace=False)[source]¶

Generate a local alignment between two Sequence instances

Parameters:

Parameters:	other : `Sequence` id_chars : int, optional nonid_chars : int, optional gap_open_pen : float, optional gap_ext_pen : float, optional inplace : bool, optional Replace the saved order of residues [default: False]
Returns:	`Sequence` The reference to the `Sequence`, regardless of inplace `Sequence` The reference to the `Sequence`, regardless of inplace

other : Sequence

id_chars : int, optional

nonid_chars : int, optional

gap_open_pen : float, optional

gap_ext_pen : float, optional

inplace : bool, optional

Replace the saved order of residues [default: False]

Returns:

Sequence

The reference to the Sequence, regardless of inplace

Sequence

The reference to the Sequence, regardless of inplace

remark¶: The Sequence-specific remarks

seq¶: The protein sequence as str

seq_len¶: The protein sequence length

class SequenceFile(id)[source]¶

Bases: conkit.core._Entity

A sequence file object representing a single sequence file

The SequenceFile class represents a data structure to hold Sequence instances in a single sequence file. It contains functions to store and analyze sequences.

Examples

>>> from conkit.core import Sequence, SequenceFile
>>> sequence_file = SequenceFile("example")
>>> sequence_file.add(Sequence("foo", "ABCDEF"))
>>> sequence_file.add(Sequence("bar", "ZYXWVU"))
>>> print(sequence_file)
SequenceFile(id="example" nseqs=2)

Attributes

`id`	The ID of the selected entity
`is_alignment`	A boolean status for the alignment
`nseqs`	The number of `Sequence` instances
`remark`	The `SequenceFile`-specific remarks
`status`	An indication of the residue status, i.e true positive, false positive, or unknown
`top_sequence`	The first `Sequence` entry in `SequenceFile`

Methods

`add`(entity)	Add a child to the `Entity`
`calculate_freq`()	Calculate the gap frequency in each alignment column
`calculate_meff`([identity])	Calculate the number of effective sequences
`copy`()	Create a shallow copy of `Entity`
`deepcopy`()	Create a deep copy of `Entity`
`remove`(id)	Remove a child
`sort`(kword[, reverse, inplace])	Sort the `SequenceFile`
`trim`(start, end[, inplace])	Trim the `SequenceFile`

calculate_freq()[source]¶

Calculate the gap frequency in each alignment column

This function calculates the frequency of gaps at each position in the Multiple Sequence Alignment.

Returns:

Returns:	list A list containing the per alignment-column amino acid frequency count
Raises:	MemoryError Too many sequences in the alignment RuntimeError `SequenceFile` is not an alignment

list

A list containing the per alignment-column amino acid frequency count

Raises:

MemoryError

Too many sequences in the alignment

RuntimeError

SequenceFile is not an alignment

calculate_meff(identity=0.7)[source]¶

Calculate the number of effective sequences

This function calculates the number of effective sequences (Meff) in the Multiple Sequence Alignment.

The mathematical function used to calculate Meff is

\[M_{eff}=\sum_{i}\frac{1}{\sum_{j}S_{i,j}}\]

Parameters:

Parameters:	identity : float, optional The sequence identity to use for similarity decision [default: 0.7]
Returns:	int The number of effective sequences
Raises:	MemoryError Too many sequences in the alignment for Hamming distance calculation RuntimeError SciPy package not installed ValueError `SequenceFile` is not an alignment ValueError Sequence Identity needs to be between 0 and 1

identity : float, optional

The sequence identity to use for similarity decision [default: 0.7]

Returns:

int

The number of effective sequences

Raises:

MemoryError

Too many sequences in the alignment for Hamming distance calculation

RuntimeError

SciPy package not installed

ValueError

SequenceFile is not an alignment

ValueError

Sequence Identity needs to be between 0 and 1

is_alignment¶

A boolean status for the alignment

Returns:

Returns:	bool A boolean status for the alignment

bool

A boolean status for the alignment

nseqs¶

The number of Sequence instances in the SequenceFile

Returns:

Returns:	int The number of sequences in the `SequenceFile`

int

The number of sequences in the SequenceFile

remark¶: The SequenceFile-specific remarks

sort(kword, reverse=False, inplace=False)[source]¶

Sort the SequenceFile

Parameters:

Parameters:	kword : str The dictionary key to sort sequences by reverse : bool, optional Sort the sequences in reverse order [default: False] inplace : bool, optional Replace the saved order of sequences [default: False]
Returns:	`SequenceFile` The reference to the `SequenceFile`, regardless of inplace
Raises:	ValueError `kword` not in `SequenceFile`

kword : str

The dictionary key to sort sequences by

reverse : bool, optional

Sort the sequences in reverse order [default: False]

inplace : bool, optional

Replace the saved order of sequences [default: False]

Returns:

SequenceFile

The reference to the SequenceFile, regardless of inplace

Raises:

ValueError

kword not in SequenceFile

status¶: An indication of the residue status, i.e true positive, false positive, or unknown

top_sequence¶

The first Sequence entry in SequenceFile

Returns:

Returns:	`Sequence`, None The first `Sequence` entry in `SequenceFile`

Sequence, None

The first Sequence entry in SequenceFile

trim(start, end, inplace=False)[source]¶

Trim the SequenceFile

Parameters:

Parameters:	start : int First residue to include end : int Final residue to include inplace : bool, optional Replace the saved order of sequences [default: False]
Returns:	`SequenceFile` The reference to the `SequenceFile`, regardless of inplace

start : int

First residue to include

end : int

Final residue to include

inplace : bool, optional

Replace the saved order of sequences [default: False]

Returns:

SequenceFile

The reference to the SequenceFile, regardless of inplace