conkit.core.sequencefile module

SequenceFile container used throughout ConKit

class SequenceFile(id)[source]

Bases: conkit.core.entity.Entity

A sequence file object representing a single sequence file

The SequenceFile class represents a data structure to hold Sequence instances in a single sequence file. It contains functions to store and analyze sequences.

id

A unique identifier

Type:str
is_alignment

A boolean status for the alignment

Type:bool
meff

The number of effective sequences in the SequenceFile

Type:int
nseq

The number of sequences in the SequenceFile

Type:int
remark

The SequenceFile-specific remarks

Type:list
status

An indication of the sequence file, i.e alignment, no alignment, or unknown

Type:int
top_sequence

The first Sequence entry in the file

Type:Sequence, None

Examples

>>> from conkit.core import Sequence, SequenceFile
>>> sequence_file = SequenceFile("example")
>>> sequence_file.add(Sequence("foo", "ABCDEF"))
>>> sequence_file.add(Sequence("bar", "ZYXWVU"))
>>> print(sequence_file)
SequenceFile(id="example" nseq=2)
ascii_matrix

The alignment encoded in a 2-D ASCII matrix

diversity

The diversity of an alignment defined by \(\sqrt{N}/L\).

N equals the number of sequences in the alignment and L the sequence length

empty

Status of emptiness of sequencefile

encoded_matrix

The alignment encoded for contact prediction

filter(min_id=0.3, max_id=0.9, inplace=False)[source]

Filter sequences from an alignment according to the minimum and maximum identity between the sequences

Parameters:
  • min_id (float, optional) – Minimum sequence identity
  • max_id (float, optional) – Maximum sequence identity
  • inplace (bool, optional) – Replace the saved order of sequences [default: False]
Returns:

The reference to the SequenceFile, regardless of inplace

Return type:

SequenceFile

Raises:
filter_gapped(min_prop=0.0, max_prop=0.9, inplace=True)[source]

Filter all sequences a gap proportion greater than the limit

Parameters:
  • min_prop (float, optional) – Minimum allowed gap proportion [default: 0.0]
  • max_prop (float, optional) – Maximum allowed gap proportion [default: 0.9]
  • inplace (bool, optional) – Replace the saved order of sequences [default: False]
Returns:

The reference to the SequenceFile, regardless of inplace

Return type:

SequenceFile

Raises:
get_frequency(symbol)[source]

Calculate the frequency of an amino acid (symbol) in each Multiple Sequence Alignment column

Returns:A list containing the per alignment-column amino acid frequency count
Return type:list
Raises:RuntimeErrorSequenceFile is not an alignment
get_meff_with_id(identity)[source]

Calculate the number of effective sequences with specified sequence identity

See also

meff(), get_weights()

get_weights(identity=0.8)[source]

Calculate the sequence weights

This function calculates the sequence weights in the the Multiple Sequence Alignment.

The mathematical function used to calculate Meff is

\[M_{eff}=\sum_{i}\frac{1}{\sum_{j}S_{i,j}}\]
Parameters:

identity (float, optional) – The sequence identity to use for similarity decision [default: 0.8]

Returns:

A list of the sequence weights in the alignment

Return type:

list

Raises:
is_alignment

A boolean status for the alignment

Returns:A boolean status for the alignment
Return type:bool
meff

The number of effective sequences

nseq

The number of sequences

remark

The SequenceFile-specific remarks

sort(kword, reverse=False, inplace=False)[source]

Sort the SequenceFile

Parameters:
  • kword (str) – The dictionary key to sort sequences by
  • reverse (bool, optional) – Sort the sequences in reverse order [default: False]
  • inplace (bool, optional) – Replace the saved order of sequences [default: False]
Returns:

The reference to the SequenceFile, regardless of inplace

Return type:

SequenceFile

Raises:

ValueErrorkword not in SequenceFile

status

An indication of the residue status, i.e true positive, false positive, or unknown

summary()[source]

Generate a summary for the SequenceFile

Returns:
Return type:str
to_string()[source]

Return the SequenceFile as str

top_sequence

The first Sequence entry in SequenceFile

Returns:The first Sequence entry in SequenceFile
Return type:Sequence
trim(start, end, inplace=False)[source]

Trim the SequenceFile

Parameters:
  • start (int) – First residue to include
  • end (int) – Final residue to include
  • inplace (bool, optional) – Replace the saved order of sequences [default: False]
Returns:

The reference to the SequenceFile, regardless of inplace

Return type:

SequenceFile