# conkit.core.sequencefile module¶

SequenceFile container used throughout ConKit

class SequenceAlignmentState[source]

Bases: enum.Enum

Alignment states

aligned = 2
unaligned = 1
unknown = 0
class SequenceFile(id)[source]

Bases: conkit.core._entity._Entity

A sequence file object representing a single sequence file

The SequenceFile class represents a data structure to hold Sequence instances in a single sequence file. It contains functions to store and analyze sequences.

Examples

>>> from conkit.core import Sequence, SequenceFile
>>> sequence_file = SequenceFile("example")
>>> print(sequence_file)
SequenceFile(id="example" nseq=2)


Attributes

 id The ID of the selected entity is_alignment A boolean status for the alignment neff The number of effective sequences nseq The number of sequences remark The SequenceFile-specific remarks status An indication of the residue status, i.e true positive, false positive, or unknown top_sequence The first Sequence entry in SequenceFile

Methods

 add(entity) Add a child to the Entity calculate_freq() Calculate the gap frequency in each alignment column calculate_meff([identity]) Calculate the number of effective sequences calculate_neff_with_identity(identity) Calculate the number of effective sequences with specified sequence identity calculate_weights([identity]) Calculate the sequence weights copy() Create a shallow copy of Entity deepcopy() Create a deep copy of Entity filter([min_id, max_id, inplace]) Filter an alignment remove(id) Remove a child sort(kword[, reverse, inplace]) Sort the SequenceFile trim(start, end[, inplace]) Trim the SequenceFile
ascii_matrix

The alignment encoded in a 2-D ASCII matrix

calculate_freq()[source]

Calculate the gap frequency in each alignment column

This function calculates the frequency of gaps at each position in the Multiple Sequence Alignment.

Returns: list A list containing the per alignment-column amino acid frequency count MemoryError Too many sequences in the alignment RuntimeError SequenceFile is not an alignment
calculate_meff(identity=0.8)[source]

Calculate the number of effective sequences

calculate_neff_with_identity(identity)[source]

Calculate the number of effective sequences with specified sequence identity

calculate_weights(identity=0.8)[source]

Calculate the sequence weights

This function calculates the sequence weights in the the Multiple Sequence Alignment.

The mathematical function used to calculate Meff is

$M_{eff}=\sum_{i}\frac{1}{\sum_{j}S_{i,j}}$
Parameters: identity : float, optional The sequence identity to use for similarity decision [default: 0.8] list A list of the sequence weights in the alignment MemoryError Too many sequences in the alignment for Hamming distance calculation RuntimeError SciPy package not installed ValueError SequenceFile is not an alignment ValueError Sequence Identity needs to be between 0 and 1
empty

Status of emptiness of sequencefile

filter(min_id=0.3, max_id=0.9, inplace=False)[source]

Filter an alignment

Parameters: min_id : float, optional max_id : float, optional inplace : bool, optional Replace the saved order of sequences [default: False] obj The reference to the SequenceFile, regardless of inplace MemoryError Too many sequences in the alignment for Hamming distance calculation RuntimeError SciPy package not installed ValueError SequenceFile is not an alignment ValueError Minimum sequence Identity needs to be between 0 and 1 ValueError Maximum sequence Identity needs to be between 0 and 1
is_alignment

A boolean status for the alignment

Returns: bool A boolean status for the alignment
neff

The number of effective sequences

nseq

The number of sequences

remark

The SequenceFile-specific remarks

sort(kword, reverse=False, inplace=False)[source]

Sort the SequenceFile

Parameters: kword : str The dictionary key to sort sequences by reverse : bool, optional Sort the sequences in reverse order [default: False] inplace : bool, optional Replace the saved order of sequences [default: False] obj The reference to the SequenceFile, regardless of inplace ValueError kword not in SequenceFile
status

An indication of the residue status, i.e true positive, false positive, or unknown

top_sequence

The first Sequence entry in SequenceFile

Returns: obj The first Sequence entry in SequenceFile
trim(start, end, inplace=False)[source]

Trim the SequenceFile

Parameters: start : int First residue to include end : int Final residue to include inplace : bool, optional Replace the saved order of sequences [default: False] obj The reference to the SequenceFile, regardless of inplace