Core package

Core entities for hierarchy construction

class Contact(res1_seq, res2_seq, raw_score, distance_bound=(0, 8))[source]

Bases: conkit.core._Entity

A contact pair template to store all associated information

Examples

>>> from conkit.core import Contact
>>> contact = Contact(1, 25, 1.0)
>>> print(contact)
Contact(id="(1, 25)" res1="A" res1_seq=1 res2="A" res2_seq=25 raw_score=1.0)

Attributes

distance_bound The lower and upper distance boundary values of a contact pair in Ångstrom [Default: 0-8Å].
id The ID of the selected entity
is_match A boolean status for the contact
is_mismatch A boolean status for the contact
is_unknown A boolean status for the contact
lower_bound The lower distance boundary value
raw_score The prediction score for the contact pair
res1 The amino acid of residue 1 [default: X]
res2 The amino acid of residue 2 [default: X]
res1_chain The chain for residue 1
res2_chain The chain for residue 2
res1_seq The residue sequence number of residue 1
res2_seq The residue sequence number of residue 2
res1_altseq The alternative residue sequence number of residue 1
res2_altseq The alternative residue sequence number of residue 2
scalar_score The raw_score scaled according to the average raw_score
status An indication of the residue status, i.e true positive, false positive, or unknown
upper_bound The upper distance boundary value
weight A separate internal weight factor for the contact pair

Methods

add(entity) Add a child to the Entity
copy() Create a shallow copy of Entity
deepcopy() Create a deep copy of Entity
define_match() Define a contact as matching contact
define_mismatch() Define a contact as mismatching contact
define_unknown() Define a contact with unknown status
remove(id) Remove a child
define_match()[source]

Define a contact as matching contact

define_mismatch()[source]

Define a contact as mismatching contact

define_unknown()[source]

Define a contact with unknown status

distance_bound

The lower and upper distance boundary values of a contact pair in Ångstrom [Default: 0-8Å].

is_match

A boolean status for the contact

is_mismatch

A boolean status for the contact

is_unknown

A boolean status for the contact

lower_bound

The lower distance boundary value

raw_score

The prediction score for the contact pair

res1

The amino acid of residue 1 [default: X]

res1_altseq

The alternative residue sequence number of residue 1

res1_chain

The chain for residue 1

res1_seq

The residue sequence number of residue 1

res2

The amino acid of residue 2 [default: X]

res2_altseq

The alternative residue sequence number of residue 2

res2_chain

The chain for residue 2

res2_seq

The residue sequence number of residue 2

scalar_score

The raw_score scaled according to the average raw_score

status

An indication of the residue status, i.e true positive, false positive, or unknown

upper_bound

The upper distance boundary value

weight

A separate internal weight factor for the contact pair

class ContactFile(id)[source]

Bases: conkit.core._Entity

A contact file object representing a single prediction file

The contact file class represents a data structure to hold all predictions with a single contact map file. It contains functions to store, manipulate and organise contact maps.

Examples

>>> from conkit.core import ContactMap, ContactFile
>>> contact_file = ContactFile("example")
>>> contact_file.add(ContactMap("foo"))
>>> contact_file.add(ContactMap("bar"))
>>> print(contact_file)
ContactFile(id="example" nseqs=2)

Attributes

author The author of the ContactFile
method The ContactFile-specific method
remark The ContactFile-specific remarks
target The target name
top_map The first ContactMap entry in ContactFile

Methods

add(entity) Add a child to the Entity
copy() Create a shallow copy of Entity
deepcopy() Create a deep copy of Entity
remove(id) Remove a child
sort(kword[, reverse, inplace]) Sort the ContactFile
author

The author of the ContactFile

method

The ContactFile-specific method

remark

The ContactFile-specific remarks

sort(kword, reverse=False, inplace=False)[source]

Sort the ContactFile

Parameters:

kword : str

The dictionary key to sort contacts by

reverse : bool, optional

Sort the contact pairs in descending order [default: False]

inplace : bool, optional

Replace the saved order of contacts [default: False]

Returns:

contact_map : ContactMap

The reference to the ContactMap, regardless of inplace

Raises:

ValueError

kword not in ContactFile

target

The target name

top_map

The first ContactMap entry in ContactFile

Returns:

top_map : ContactMap, None

The first ContactMap entry in ContactFile

class ContactMap(id)[source]

Bases: conkit.core._Entity

A contact map object representing a single prediction

The ContactMap class represents a data structure to hold a single contact map prediction in one place. It contains functions to store, manipulate and organise Contact instances.

Examples

>>> from conkit.core import Contact, ContactMap
>>> contact_map = ContactMap("example")
>>> contact_map.add(Contact(1, 10, 0.333))
>>> contact_map.add(Contact(5, 30, 0.667))
>>> print(contact_map)
ContactMap(id="example" ncontacts=2)

Attributes

coverage The sequence coverage score
id The ID of the selected entity
ncontacts The number of Contact instances in the ContactMap
precision The precision (Positive Predictive Value) score
repr_sequence The representative Sequence associated with the ContactMap
repr_sequence_altloc The representative altloc Sequence associated with the ContactMap
sequence The Sequence associated with the ContactMap
top_contact The first Contact entry in ContactMap

Methods

add(entity) Add a child to the Entity
assign_sequence_register([altloc]) Assign the amino acids from Sequence to all Contact instances
calculate_jaccard_index(other) Calculate the Jaccard index between two ContactMap instances
calculate_kernel_density([bw_method]) Calculate the contact density in the contact map using Gaussian kernels
calculate_scalar_score() Calculate a scaled score for the ContactMap
copy() Create a shallow copy of Entity
deepcopy() Create a deep copy of Entity
find(indexes[, altloc]) Find all contacts associated with index
match(other[, remove_unmatched, renumber, ...]) Modify both hierarchies so residue numbers match one another.
remove(id) Remove a child
remove_neighbors([min_distance, inplace]) Remove contacts between neighboring residues
rescale([inplace]) Rescale the raw scores in ContactMap
sort(kword[, reverse, inplace]) Sort the ContactMap
assign_sequence_register(altloc=False)[source]

Assign the amino acids from Sequence to all Contact instances

Parameters:

altloc : bool

Use the res_altloc positions [default: False]

calculate_jaccard_index(other)[source]

Calculate the Jaccard index between two ContactMap instances

This score analyzes the difference of the predicted contacts from two maps,

\[J_{x,y}=\frac{\left|x \cap y\right|}{\left|x \cup y\right|}\]

where \(x\) and \(y\) are the sets of predicted contacts from two different predictors, \(\left|x \cap y\right|\) is the number of elements in the intersection of \(x\) and \(y\), and the \(\left|x \cup y\right|\) represents the number of elements in the union of \(x\) and \(y\).

The J-score has values in the range of \([0, 1]\), with a value of \(1\) corresponding to identical contact maps and \(0\) to dissimilar ones.

Parameters:

other : ContactMap

A ConKit ContactMap

Returns:

float

The Jaccard distance

Warning

The Jaccard distance ranges from \([0, 1]\), where \(1\) means the maps contain identical contacts pairs.

See also

match, precision

Notes

The Jaccard index is different from the Jaccard distance mentioned in [1]. The Jaccard distance corresponds to \(1-Jaccard_{index}\).

[1]Q. Wuyun, W. Zheng, Z. Peng, J. Yang (2016). A large-scale comparative assessment of methods for residue-residue contact prediction. Briefings in Bioinformatics, [doi: 10.1093/bib/bbw106].
calculate_kernel_density(bw_method='amise')[source]

Calculate the contact density in the contact map using Gaussian kernels

Various algorithms can be used to estimate the bandwidth. To calculate the bandwidth for an 1D data array X with n data points and d dimensions, the listed algorithms have been implemented. Please note, in rules 2 and 3, the value of \(\sigma\) is the smaller of the standard deviation of X or the normalized interquartile range.

  1. Asymptotic Mean Integrated Squared Error (AMISE)

    This particular choice of bandwidth recovers all the important features whilst maintaining smoothness. It is a direct implementation of the method used by [2].

  2. Bowman & Azzalini [3] implementation

\[\sqrt{\frac{\sum{X}^2}{n}-(\frac{\sum{X}}{n})^2}*(\frac{(d+2)*n}{4})^\frac{-1}{d+4}\]
  1. Scott’s [4] implementation
\[1.059*\sigma*n^\frac{-1}{d+4}\]
  1. Silverman’s [5] implementation
\[0.9*\sigma*(n*\frac{d+2}{4})^\frac{-1}{d+4}\]
[2]Sadowski, M.I. (2013). Prediction of protein domain boundaries from inverse covariances.
[3]Bowman, A.W. & Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis.
[4]Scott, D.W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization.
[5]Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis.
Parameters:

bw_method : str, optional

The bandwidth estimator to use [default: amise_sekant]

Returns:

list

The list of per-residue density estimates

Raises:

RuntimeError

Cannot find SciKit package

ValueError

Undefined bandwidth method

calculate_scalar_score()[source]

Calculate a scaled score for the ContactMap

This score is a scaled score for all raw scores in a contact map. It is defined by the formula

\[{x}'=\frac{x}{\overline{d}}\]

where \(x\) corresponds to the raw score of each predicted contact and \(\overline{d}\) to the mean of all raw scores.

The score is saved in a separate Contact attribute called scalar_score

This score is described in more detail in [6].

[6]S. Ovchinnikov, L. Kinch, H. Park, Y. Liao, J. Pei, D.E. Kim, H. Kamisetty, N.V. Grishin, D. Baker (2015). Large-scale determination of previously unsolved protein structures using evolutionary information. Elife 4, e09248.
coverage

The sequence coverage score

The coverage score is calculated by analysing the number of residues covered by the predicted contact pairs.

\[Coverage=\frac{x_{cov}}{L}\]

The coverage score is calculated by dividing the number of contacts \(x_{cov}\) by the number of residues in the sequence \(L\).

Returns:

cov : float

The calculated coverage score

See also

precision

find(indexes, altloc=False)[source]

Find all contacts associated with index

Parameters:

index : list, tuple

A list of residue indexes to find

altloc : bool

Use the res_altloc positions [default: False]

Returns:

ContactMap

A modified version of the contact map containing the found contacts

match(other, remove_unmatched=False, renumber=False, inplace=False)[source]

Modify both hierarchies so residue numbers match one another.

This function is key when plotting contact maps or visualising contact maps in 3-dimensional space. In particular, when residue numbers in the structure do not start at count 0 or when peptide chain breaks are present.

Parameters:

other : ContactMap

A ConKit ContactMap

remove_unmatched : bool, optional

Remove all unmatched contacts [default: False]

renumber : bool, optional

Renumber the res_seq entries [default: False]

If True, res1_seq and res2_seq changes but id remains the same

inplace : bool, optional

Replace the saved order of contacts [default: False]

Returns:

hierarchy_mod

ContactMap instance, regardless of inplace

Raises:

ValueError

Error creating reliable keymap matching the sequence in ContactMap

ncontacts

The number of Contact instances in the ContactMap

Returns:

ncontacts : int

The number of sequences in the ContactMap

precision

The precision (Positive Predictive Value) score

The precision value is calculated by analysing the true and false postive contacts.

\[Precision=\frac{TruePositives}{TruePositives - FalsePositives}\]

The status of each contact, i.e true or false positive status, can be determined by running the match() function providing a reference structure.

Returns:

ppv : float

The calculated precision score

See also

coverage

remove_neighbors(min_distance=5, inplace=False)[source]

Remove contacts between neighboring residues

Parameters:

min_distance : int, optional

The minimum number of residues between contacts [default: 5]

inplace : bool, optional

Replace the saved order of contacts [default: False]

Returns:

contact_map : ContactMap

The reference to the ContactMap, regardless of inplace

repr_sequence

The representative Sequence associated with the ContactMap

The peptide sequence constructed from the available contacts using the normal res_seq positions

Returns:

sequence : conkit.coreSequence

Raises:

TypeError

Sequence undefined

repr_sequence_altloc

The representative altloc Sequence associated with the ContactMap

The peptide sequence constructed from the available contacts using the altloc res_seq positions

Returns:

sequence : Sequence

Raises:

ValueError

Sequence undefined

rescale(inplace=False)[source]

Rescale the raw scores in ContactMap

Rescaling of the data is done to normalize the raw scores to be in the range [0, 1]. The formula to rescale the data is:

\[{x}'=\frac{x-min(d)}{max(d)-min(d)}\]

\(x\) is the original value and \(d\) are all values to be rescaled.

Parameters:

inplace : bool, optional

Replace the saved order of contacts [default: False]

Returns:

contact_map : ContactMap

The reference to the ContactMap, regardless of inplace

sequence

The Sequence associated with the ContactMap

Returns:Sequence
sort(kword, reverse=False, inplace=False)[source]

Sort the ContactMap

Parameters:

kword : str

The dictionary key to sort contacts by

reverse : bool, optional

Sort the contact pairs in descending order [default: False]

inplace : bool, optional

Replace the saved order of contacts [default: False]

Returns:

contact_map : ContactMap

The reference to the ContactMap, regardless of inplace

Raises:

ValueError

kword not in ContactMap

top_contact

The first Contact entry in ContactMap

Returns:

top_contact : Contact, None

The first Contact entry in ContactFile

class Sequence(id, seq)[source]

Bases: conkit.core._Entity

A sequence template to store all associated information

Examples

>>> from conkit.core import Sequence
>>> sequence_entry = Sequence("example", "ABCDEF")
>>> print(sequence_entry)
Sequence(id="example" seq="ABCDEF" seqlen=6)

Attributes

id The ID of the selected entity
remark The Sequence-specific remarks
seq The protein sequence as str
seq_len The protein sequence length

Methods

add(entity) Add a child to the Entity
align_global(other[, id_chars, nonid_chars, ...]) Generate a global alignment between two Sequence instances
align_local(other[, id_chars, nonid_chars, ...]) Generate a local alignment between two Sequence instances
copy() Create a shallow copy of Entity
deepcopy() Create a deep copy of Entity
remove(id) Remove a child
align_global(other, id_chars=2, nonid_chars=1, gap_open_pen=-0.5, gap_ext_pen=-0.1, inplace=False)[source]

Generate a global alignment between two Sequence instances

Parameters:

other : Sequence

id_chars : int, optional

nonid_chars : int, optional

gap_open_pen : float, optional

gap_ext_pen : float, optional

inplace : bool, optional

Replace the saved order of residues [default: False]

Returns:

Sequence

The reference to the Sequence, regardless of inplace

Sequence

The reference to the Sequence, regardless of inplace

align_local(other, id_chars=2, nonid_chars=1, gap_open_pen=-0.5, gap_ext_pen=-0.1, inplace=False)[source]

Generate a local alignment between two Sequence instances

Parameters:

other : Sequence

id_chars : int, optional

nonid_chars : int, optional

gap_open_pen : float, optional

gap_ext_pen : float, optional

inplace : bool, optional

Replace the saved order of residues [default: False]

Returns:

Sequence

The reference to the Sequence, regardless of inplace

Sequence

The reference to the Sequence, regardless of inplace

remark

The Sequence-specific remarks

seq

The protein sequence as str

seq_len

The protein sequence length

class SequenceFile(id)[source]

Bases: conkit.core._Entity

A sequence file object representing a single sequence file

The SequenceFile class represents a data structure to hold Sequence instances in a single sequence file. It contains functions to store and analyze sequences.

Examples

>>> from conkit.core import Sequence, SequenceFile
>>> sequence_file = SequenceFile("example")
>>> sequence_file.add(Sequence("foo", "ABCDEF"))
>>> sequence_file.add(Sequence("bar", "ZYXWVU"))
>>> print(sequence_file)
SequenceFile(id="example" nseqs=2)

Attributes

id The ID of the selected entity
is_alignment A boolean status for the alignment
nseqs The number of Sequence instances
remark The SequenceFile-specific remarks
status An indication of the residue status, i.e true positive, false positive, or unknown
top_sequence The first Sequence entry in SequenceFile

Methods

add(entity) Add a child to the Entity
calculate_freq() Calculate the gap frequency in each alignment column
calculate_meff([identity]) Calculate the number of effective sequences
copy() Create a shallow copy of Entity
deepcopy() Create a deep copy of Entity
remove(id) Remove a child
sort(kword[, reverse, inplace]) Sort the SequenceFile
trim(start, end[, inplace]) Trim the SequenceFile
calculate_freq()[source]

Calculate the gap frequency in each alignment column

This function calculates the frequency of gaps at each position in the Multiple Sequence Alignment.

Returns:

list

A list containing the per alignment-column amino acid frequency count

Raises:

MemoryError

Too many sequences in the alignment

RuntimeError

SequenceFile is not an alignment

calculate_meff(identity=0.7)[source]

Calculate the number of effective sequences

This function calculates the number of effective sequences (Meff) in the Multiple Sequence Alignment.

The mathematical function used to calculate Meff is

\[M_{eff}=\sum_{i}\frac{1}{\sum_{j}S_{i,j}}\]
Parameters:

identity : float, optional

The sequence identity to use for similarity decision [default: 0.7]

Returns:

int

The number of effective sequences

Raises:

MemoryError

Too many sequences in the alignment for Hamming distance calculation

RuntimeError

SciPy package not installed

ValueError

SequenceFile is not an alignment

ValueError

Sequence Identity needs to be between 0 and 1

is_alignment

A boolean status for the alignment

Returns:

bool

A boolean status for the alignment

nseqs

The number of Sequence instances in the SequenceFile

Returns:

int

The number of sequences in the SequenceFile

remark

The SequenceFile-specific remarks

sort(kword, reverse=False, inplace=False)[source]

Sort the SequenceFile

Parameters:

kword : str

The dictionary key to sort sequences by

reverse : bool, optional

Sort the sequences in reverse order [default: False]

inplace : bool, optional

Replace the saved order of sequences [default: False]

Returns:

SequenceFile

The reference to the SequenceFile, regardless of inplace

Raises:

ValueError

kword not in SequenceFile

status

An indication of the residue status, i.e true positive, false positive, or unknown

top_sequence

The first Sequence entry in SequenceFile

Returns:

Sequence, None

The first Sequence entry in SequenceFile

trim(start, end, inplace=False)[source]

Trim the SequenceFile

Parameters:

start : int

First residue to include

end : int

Final residue to include

inplace : bool, optional

Replace the saved order of sequences [default: False]

Returns:

SequenceFile

The reference to the SequenceFile, regardless of inplace