BioInfoRx - Bio-Linux Software Documentation Pages

cd-hit-est

Name	cd-hit-est
Description	cd-hit-est is part of a suite of programs designed to quickly group sequences. cd-hit-est groups nucleotide sequences (without introns) into clusters that meet a user-defined similarity threshold. Input is a fasta file of protein sequences. Output is a file of (non-redundant) representative sequences and a file listing the proteins in each cluster. Other programs in this suite are: mcd-hit - a modified version of cd-hit, designed for sets of proteins of very different lengths. It uses a low clustering threshold. cd-hit-2d - compares two protein data sets. It provides a list similar sequences in the two sets, and a list of sequnces in the second set that are not similar to sequences in the first set. cd-hit - similar to cd-hit-est, but designed to group protein sequences. cd-hit-est-2d - similar to cd-hit-2d but designed to compare two nucleotide datasets. The scripts below use the output files (.clstr files) as input and generate reports. plot_len.pl - generates a text file of distributions of clusters and sequences. clstr_sort_by.pl - sorts clusters by length and number of sequences in the cluster. clstr_sort_prot_by.pl - sorts sequences within clusters by length and name. clstr_merge.pl - merge two or more .clstr files. clstr_renumber.pl - re-numbers clusters and sequences within clusters in a clstr file after merging (or other operations). clstr_rev.pl - combines a .clstr file with its parent .clstr file. make_multi_seq.pl - reads a .clstr file and makes a fasta file for each cluster over a certain size.
Homepage	http://cd-hit.org
Remote Documentation	http://bioinformatics.oxfordjournals.org/cgi/reprint/22/13/1658 http://www.bioinformatics.org/cd-hit/cd-hit-user-guide.pdf

cd-hit userguide