conkit.core.sequencefile module¶

SequenceFile container used throughout ConKit

class SequenceFile(id)[source]¶

Bases: conkit.core.entity.Entity

A sequence file object representing a single sequence file

The SequenceFile class represents a data structure to hold Sequence instances in a single sequence file. It contains functions to store and analyze sequences.

id¶

A unique identifier

Type:	str

is_alignment¶

A boolean status for the alignment

Type:	bool

meff¶

The number of effective sequences in the SequenceFile

Type:	int

nseq¶

The number of sequences in the SequenceFile

Type:	int

remark¶

The SequenceFile-specific remarks

Type:	list

status¶

An indication of the sequence file, i.e alignment, no alignment, or unknown

Type:	int

top_sequence¶

The first Sequence entry in the file

Type:	`Sequence`, None

Examples

>>> from conkit.core import Sequence, SequenceFile
>>> sequence_file = SequenceFile("example")
>>> sequence_file.add(Sequence("foo", "ABCDEF"))
>>> sequence_file.add(Sequence("bar", "ZYXWVU"))
>>> print(sequence_file)
SequenceFile(id="example" nseq=2)

ascii_matrix¶: The alignment encoded in a 2-D ASCII matrix

diversity¶

The diversity of an alignment defined by \(\sqrt{N}/L\).

N equals the number of sequences in the alignment and L the sequence length

empty¶: Status of emptiness of sequencefile

encoded_matrix¶: The alignment encoded for contact prediction

filter(min_id=0.3, max_id=0.9, inplace=False)[source]¶

Filter sequences from an alignment according to the minimum and maximum identity between the sequences

Parameters:	min_id (float, optional) – Minimum sequence identity max_id (float, optional) – Maximum sequence identity inplace (bool, optional) – Replace the saved order of sequences [default: False]
Returns:	The reference to the `SequenceFile`, regardless of inplace
Return type:	`SequenceFile`
Raises:	`ValueError` – `SequenceFile` is not an alignment `ValueError` – Minimum sequence identity needs to be between 0 and 1 `ValueError` – Maximum sequence identity needs to be between 0 and 1

filter_gapped(min_prop=0.0, max_prop=0.9, inplace=True)[source]¶

Filter all sequences a gap proportion greater than the limit

Parameters:	min_prop (float, optional) – Minimum allowed gap proportion [default: 0.0] max_prop (float, optional) – Maximum allowed gap proportion [default: 0.9] inplace (bool, optional) – Replace the saved order of sequences [default: False]
Returns:	The reference to the `SequenceFile`, regardless of inplace
Return type:	`SequenceFile`
Raises:	`ValueError` – `SequenceFile` is not an alignment `ValueError` – Minimum gap proportion needs to be between 0 and 1 `ValueError` – Maximum gap proportion needs to be between 0 and 1

get_frequency(symbol)[source]¶

Calculate the frequency of an amino acid (symbol) in each Multiple Sequence Alignment column

Returns:	A list containing the per alignment-column amino acid frequency count
Return type:	list
Raises:	`RuntimeError` – `SequenceFile` is not an alignment

get_meff_with_id(identity)[source]¶: Calculate the number of effective sequences with specified sequence identity

See also

meff(), get_weights()

get_weights(identity=0.8)[source]¶

Calculate the sequence weights

This function calculates the sequence weights in the the Multiple Sequence Alignment.

The mathematical function used to calculate Meff is

\[M_{eff}=\sum_{i}\frac{1}{\sum_{j}S_{i,j}}\]

Parameters:	identity (float, optional) – The sequence identity to use for similarity decision [default: 0.8]
Returns:	A list of the sequence weights in the alignment
Return type:	list
Raises:	`ValueError` – `SequenceFile` is not an alignment `ValueError` – Sequence Identity needs to be between 0 and 1

is_alignment

A boolean status for the alignment

Returns:	A boolean status for the alignment
Return type:	bool

meff: The number of effective sequences

nseq: The number of sequences

remark: The SequenceFile-specific remarks

sort(kword, reverse=False, inplace=False)[source]¶

Sort the SequenceFile

Parameters:	kword (str) – The dictionary key to sort sequences by reverse (bool, optional) – Sort the sequences in reverse order [default: False] inplace (bool, optional) – Replace the saved order of sequences [default: False]
Returns:	The reference to the `SequenceFile`, regardless of inplace
Return type:	`SequenceFile`
Raises:	`ValueError` – `kword` not in `SequenceFile`

status: An indication of the residue status, i.e true positive, false positive, or unknown

summary()[source]¶

Generate a summary for the SequenceFile

Returns:
Return type:	str

to_string()[source]¶: Return the SequenceFile as str

top_sequence

The first Sequence entry in SequenceFile

Returns:	The first `Sequence` entry in `SequenceFile`
Return type:	`Sequence`

trim(start, end, inplace=False)[source]¶

Trim the SequenceFile

Parameters:	start (int) – First residue to include end (int) – Final residue to include inplace (bool, optional) – Replace the saved order of sequences [default: False]
Returns:	The reference to the `SequenceFile`, regardless of inplace
Return type:	`SequenceFile`