cd-hit-est is part of a suite of programs designed to quickly group sequences.
cd-hit-est groups nucleotide sequences (without introns) into clusters that meet a user-defined similarity threshold.
Input is a fasta file of protein sequences. Output is a file of (non-redundant) representative sequences and a file listing the proteins in each cluster.
Other programs in this suite are:
- mcd-hit - a modified version of cd-hit, designed for sets of proteins of very different lengths. It uses a low clustering threshold.
- cd-hit-2d - compares two protein data sets. It provides a list similar sequences in the two sets, and a list of sequnces in the second set that are not similar to sequences in the first set.
- cd-hit - similar to cd-hit-est, but designed to group protein sequences.
- cd-hit-est-2d - similar to cd-hit-2d but designed to compare two nucleotide datasets.
The scripts below use the output files (.clstr files) as input and generate reports.
- plot_len.pl - generates a text file of distributions of clusters and sequences.
- clstr_sort_by.pl - sorts clusters by length and number of sequences in the cluster.
- clstr_sort_prot_by.pl - sorts sequences within clusters by length and name.
- clstr_merge.pl - merge two or more .clstr files.
- clstr_renumber.pl - re-numbers clusters and sequences within clusters in a clstr file after merging (or other operations).
- clstr_rev.pl - combines a .clstr file with its parent .clstr file.
- make_multi_seq.pl - reads a .clstr file and makes a fasta file for each cluster over a certain size.
|