conkit.core.sequencefile module

SequenceFile container used throughout ConKit

class SequenceFile(id)[source]

Bases: conkit.core._entity._Entity

A sequence file object representing a single sequence file

The SequenceFile class represents a data structure to hold Sequence instances in a single sequence file. It contains functions to store and analyze sequences.

Examples

>>> from conkit.core import Sequence, SequenceFile
>>> sequence_file = SequenceFile("example")
>>> sequence_file.add(Sequence("foo", "ABCDEF"))
>>> sequence_file.add(Sequence("bar", "ZYXWVU"))
>>> print(sequence_file)
SequenceFile(id="example" nseq=2)

Attributes

id The ID of the selected entity
is_alignment A boolean status for the alignment
meff The number of effective sequences
nseq The number of sequences
remark The SequenceFile-specific remarks
status An indication of the residue status, i.e true positive, false positive, or unknown
top_sequence The first Sequence entry in SequenceFile

Methods

add(entity) Add a child to the Entity
calculate_freq() Calculate the gap frequency in each alignment column
calculate_meff([identity]) Calculate the number of effective sequences
calculate_meff_with_identity(identity) Calculate the number of effective sequences with specified sequence identity
calculate_neff_with_identity(identity) Calculate the number of effective sequences with specified sequence identity
calculate_weights([identity]) Calculate the sequence weights
copy() Create a shallow copy of Entity
deepcopy() Create a deep copy of Entity
filter([min_id, max_id, inplace]) Filter sequences from an alignment according to the minimum and maximum identity
filter_gapped([min_prop, max_prop, inplace]) Filter all sequences a gap proportion greater than the limit
remove(id) Remove a child
sort(kword[, reverse, inplace]) Sort the SequenceFile
to_string() Return the SequenceFile as str
trim(start, end[, inplace]) Trim the SequenceFile
ascii_matrix

The alignment encoded in a 2-D ASCII matrix

calculate_freq()[source]

Calculate the gap frequency in each alignment column

This function calculates the frequency of gaps at each position in the Multiple Sequence Alignment.

Returns:

list

A list containing the per alignment-column amino acid frequency count

Raises:

MemoryError

Too many sequences in the alignment

RuntimeError

SequenceFile is not an alignment

calculate_meff(identity=0.8)[source]

Calculate the number of effective sequences

See also

meff

calculate_meff_with_identity(identity)[source]

Calculate the number of effective sequences with specified sequence identity

calculate_neff_with_identity(identity)[source]

Calculate the number of effective sequences with specified sequence identity

calculate_weights(identity=0.8)[source]

Calculate the sequence weights

This function calculates the sequence weights in the the Multiple Sequence Alignment.

The mathematical function used to calculate Meff is

\[M_{eff}=\sum_{i}\frac{1}{\sum_{j}S_{i,j}}\]
Parameters:

identity : float, optional

The sequence identity to use for similarity decision [default: 0.8]

Returns:

list

A list of the sequence weights in the alignment

Raises:

ImportError

Cannot find SciPy package

ValueError

SequenceFile is not an alignment

ValueError

Sequence Identity needs to be between 0 and 1

diversity

The diversity of an alignment defined by \(\sqrt{N}/L\).

N equals the number of sequences in the alignment and L the sequence length

empty

Status of emptiness of sequencefile

encoded_matrix

The alignment encoded for contact prediction

filter(min_id=0.3, max_id=0.9, inplace=False)[source]

Filter sequences from an alignment according to the minimum and maximum identity between the sequences

Parameters:

min_id : float, optional

Minimum sequence identity

max_id : float, optional

Maximum sequence identity

inplace : bool, optional

Replace the saved order of sequences [default: False]

Returns:

obj

The reference to the SequenceFile, regardless of inplace

Raises:

MemoryError

Too many sequences in the alignment for Hamming distance calculation

RuntimeError

SciPy package not installed

ValueError

SequenceFile is not an alignment

ValueError

Minimum sequence identity needs to be between 0 and 1

ValueError

Maximum sequence identity needs to be between 0 and 1

filter_gapped(min_prop=0.0, max_prop=0.9, inplace=True)[source]

Filter all sequences a gap proportion greater than the limit

Parameters:

min_prop : float, optional

Minimum allowed gap proportion [default: 0.0]

max_prop : float, optional

Maximum allowed gap proportion [default: 0.9]

inplace : bool, optional

Replace the saved order of sequences [default: False]

Returns:

obj

The reference to the SequenceFile, regardless of inplace

is_alignment

A boolean status for the alignment

Returns:

bool

A boolean status for the alignment

meff

The number of effective sequences

neff

The number of effective sequences

nseq

The number of sequences

remark

The SequenceFile-specific remarks

sort(kword, reverse=False, inplace=False)[source]

Sort the SequenceFile

Parameters:

kword : str

The dictionary key to sort sequences by

reverse : bool, optional

Sort the sequences in reverse order [default: False]

inplace : bool, optional

Replace the saved order of sequences [default: False]

Returns:

obj

The reference to the SequenceFile, regardless of inplace

Raises:

ValueError

kword not in SequenceFile

status

An indication of the residue status, i.e true positive, false positive, or unknown

to_string()[source]

Return the SequenceFile as str

top_sequence

The first Sequence entry in SequenceFile

Returns:

obj

The first Sequence entry in SequenceFile

trim(start, end, inplace=False)[source]

Trim the SequenceFile

Parameters:

start : int

First residue to include

end : int

Final residue to include

inplace : bool, optional

Replace the saved order of sequences [default: False]

Returns:

obj

The reference to the SequenceFile, regardless of inplace