Class SequenceUtil


  • public class SequenceUtil
    extends Object
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static byte a
      Byte typed variables for all normal bases.
      static byte A
      Byte typed variables for all normal bases.
      static byte c
      Byte typed variables for all normal bases.
      static byte C
      Byte typed variables for all normal bases.
      static byte g
      Byte typed variables for all normal bases.
      static byte G
      Byte typed variables for all normal bases.
      static byte n
      Byte typed variables for all normal bases.
      static byte N
      Byte typed variables for all normal bases.
      static byte t
      Byte typed variables for all normal bases.
      static byte T
      Byte typed variables for all normal bases.
      static byte[] VALID_BASES_LOWER  
      static byte[] VALID_BASES_UPPER  
    • Constructor Summary

      Constructors 
      Constructor Description
      SequenceUtil()  
    • Field Detail

      • a

        public static final byte a
        Byte typed variables for all normal bases.
        See Also:
        Constant Field Values
      • c

        public static final byte c
        Byte typed variables for all normal bases.
        See Also:
        Constant Field Values
      • g

        public static final byte g
        Byte typed variables for all normal bases.
        See Also:
        Constant Field Values
      • t

        public static final byte t
        Byte typed variables for all normal bases.
        See Also:
        Constant Field Values
      • n

        public static final byte n
        Byte typed variables for all normal bases.
        See Also:
        Constant Field Values
      • A

        public static final byte A
        Byte typed variables for all normal bases.
        See Also:
        Constant Field Values
      • C

        public static final byte C
        Byte typed variables for all normal bases.
        See Also:
        Constant Field Values
      • G

        public static final byte G
        Byte typed variables for all normal bases.
        See Also:
        Constant Field Values
      • T

        public static final byte T
        Byte typed variables for all normal bases.
        See Also:
        Constant Field Values
      • N

        public static final byte N
        Byte typed variables for all normal bases.
        See Also:
        Constant Field Values
      • VALID_BASES_UPPER

        public static final byte[] VALID_BASES_UPPER
      • VALID_BASES_LOWER

        public static final byte[] VALID_BASES_LOWER
    • Constructor Detail

      • SequenceUtil

        public SequenceUtil()
    • Method Detail

      • reverseComplement

        public static String reverseComplement​(String sequenceData)
        Calculate the reverse complement of the specified sequence (Stolen from Reseq)
        Parameters:
        sequenceData -
        Returns:
        reverse complement
      • basesEqual

        public static boolean basesEqual​(byte lhs,
                                         byte rhs)
        Efficiently compare two IUPAC base codes, simply returning true if they are equal (ignoring case), without considering the set relationships between ambiguous codes.
      • readBaseMatchesRefBaseWithAmbiguity

        public static boolean readBaseMatchesRefBaseWithAmbiguity​(byte readBase,
                                                                  byte refBase)
        Efficiently compare two IUPAC base codes, one coming from a read sequence and the other coming from a reference sequence, using the reference code as a 'pattern' that the read base must match. We take ambiguous codes into account, returning true if the set of possible bases represented by the read value is a (non-strict) subset of the possible bases represented by the reference value. Since the comparison is directional, make sure to pass read / ref codes in correct order.
      • isNoCall

        public static boolean isNoCall​(byte base)
        returns true if the value of base represents a no call
      • isValidBase

        public static boolean isValidBase​(byte b)
        Returns true if the byte is in [acgtACGT].
      • isUpperACGTN

        public static boolean isUpperACGTN​(byte base)
        Check if the given base is one of upper case ACGTN
      • getIUPACCodesString

        public static String getIUPACCodesString()
        Returns all IUPAC codes as a string
      • isIUPAC

        public static boolean isIUPAC​(byte base)
        Checks if the given base is a IUPAC code
      • calculateGc

        public static double calculateGc​(byte[] bases)
        Calculates the fraction of bases that are G/C in the sequence
      • isBamReadBase

        public static boolean isBamReadBase​(byte base)
        Check if the given base belongs to BAM read base set '=ABCDGHKMNRSTVWY'
      • toBamReadBasesInPlace

        public static byte[] toBamReadBasesInPlace​(byte[] bases)
        Update and return the given array of bases by upper casing and then replacing all non-BAM read bases with N
      • assertSequenceListsEqual

        public static void assertSequenceListsEqual​(List<SAMSequenceRecord> s1,
                                                    List<SAMSequenceRecord> s2)
        default signature that forces the lists to be the same size
        Parameters:
        s1 - a list of sequence headers
        s2 - a second list of sequence headers
      • assertSequenceListsEqual

        public static void assertSequenceListsEqual​(List<SAMSequenceRecord> s1,
                                                    List<SAMSequenceRecord> s2,
                                                    boolean checkPrefixOnly)
        Throws an exception only if both (first) parameters are not null optionally check that one list is a (nonempty) prefix of the other.
        Parameters:
        s1 - a list of sequence headers
        s2 - a second list of sequence headers
        checkPrefixOnly - a flag specifying whether to only look at the first records in the lists. This will then check that the records of the smaller dictionary are equal to the records of the beginning of the larger dictionary, which can be useful since sometimes different pipelines choose to use only the first contigs of a standard reference.
      • areSequenceDictionariesEqual

        public static boolean areSequenceDictionariesEqual​(SAMSequenceDictionary s1,
                                                           SAMSequenceDictionary s2)
        Returns true if both parameters are null or equal, otherwise returns false
        Parameters:
        s1 - a list of sequence headers
        s2 - a second list of sequence headers
      • assertSequenceDictionariesEqual

        public static void assertSequenceDictionariesEqual​(SAMSequenceDictionary s1,
                                                           SAMSequenceDictionary s2)
        Throws an exception if both parameters are non-null and unequal.
        Parameters:
        s1 - a list of sequence headers
        s2 - a second list of sequence headers
      • assertSequenceDictionariesEqual

        public static void assertSequenceDictionariesEqual​(SAMSequenceDictionary s1,
                                                           SAMSequenceDictionary s2,
                                                           boolean checkPrefixOnly)
        Throws an exception if both (first) parameters are non-null and unequal (if checkPrefixOnly, checks prefix of lists only).
        Parameters:
        s1 - a list of sequence headers
        s2 - a second list of sequence headers
        checkPrefixOnly - a flag specifying whether to only look at the first records in the lists. This will then check that the records of the smaller dictionary are equal to the records of the beginning of the larger dictionary, which can be useful since sometimes different pipelines choose to use only the first contigs of a standard reference.
      • assertSequenceDictionariesEqual

        public static void assertSequenceDictionariesEqual​(SAMSequenceDictionary s1,
                                                           SAMSequenceDictionary s2,
                                                           File f1,
                                                           File f2)
        Throws an exception if both parameters are non-null and unequal, including the filenames.
      • makeCigarStringWithPossibleClipping

        public static String makeCigarStringWithPossibleClipping​(int alignmentStart,
                                                                 int readLength,
                                                                 int referenceSequenceLength)
        Create a simple ungapped cigar string, which might have soft clipping at either end
        Parameters:
        alignmentStart - raw aligment start, which may result in read hanging off beginning or end of read
        Returns:
        cigar string that may have S operator at beginning or end, and has M operator for the rest of the read
      • makeCigarStringWithIndelPossibleClipping

        public static String makeCigarStringWithIndelPossibleClipping​(int alignmentStart,
                                                                      int readLength,
                                                                      int referenceSequenceLength,
                                                                      int indelPosition,
                                                                      int indelLength)
        Create a cigar string for a gapped alignment, which may have soft clipping at either end
        Parameters:
        alignmentStart - raw alignment start, which may result in read hanging off beginning or end of read
        readLength -
        referenceSequenceLength -
        indelPosition - number of matching bases before indel. Must be > 0
        indelLength - length of indel. Positive for insertion, negative for deletion.
        Returns:
        cigar string that may have S operator at beginning or end, has one or two M operators, and an I or a D.
      • makeSoftClipCigar

        public static String makeSoftClipCigar​(int clipLength)
      • countMismatches

        public static int countMismatches​(SAMRecord read,
                                          byte[] referenceBases)
        Calculates the number of mismatches between the read and the reference sequence provided.
      • countMismatches

        public static int countMismatches​(SAMRecord read,
                                          byte[] referenceBases,
                                          int referenceOffset)
        Calculates the number of mismatches between the read and the reference sequence provided.
      • countMismatches

        public static int countMismatches​(SAMRecord read,
                                          byte[] referenceBases,
                                          int referenceOffset,
                                          boolean bisulfiteSequence)
        Calculates the number of mismatches between the read and the reference sequence provided.
        Parameters:
        referenceBases - Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.
        referenceOffset - 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
        bisulfiteSequence - If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
      • countMismatches

        public static int countMismatches​(SAMRecord read,
                                          byte[] referenceBases,
                                          int referenceOffset,
                                          boolean bisulfiteSequence,
                                          boolean matchAmbiguousRef)
      • countMismatches

        public static int countMismatches​(SAMRecord read,
                                          byte[] referenceBases,
                                          boolean bisulfiteSequence)
        Calculates the number of mismatches between the read and the reference sequence provided.
        Parameters:
        referenceBases - Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.
        bisulfiteSequence - If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
      • sumQualitiesOfMismatches

        public static int sumQualitiesOfMismatches​(SAMRecord read,
                                                   byte[] referenceBases)
        Calculates the sum of qualities for mismatched bases in the read.
        Parameters:
        referenceBases - Array of ASCII bytes in which the 0th position in the array corresponds to the first element of the reference sequence to which read is aligned.
      • sumQualitiesOfMismatches

        public static int sumQualitiesOfMismatches​(SAMRecord read,
                                                   byte[] referenceBases,
                                                   int referenceOffset)
        Calculates the sum of qualities for mismatched bases in the read.
        Parameters:
        referenceBases - Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.
        referenceOffset - 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
      • sumQualitiesOfMismatches

        public static int sumQualitiesOfMismatches​(SAMRecord read,
                                                   byte[] referenceBases,
                                                   int referenceOffset,
                                                   boolean bisulfiteSequence)
        Calculates the sum of qualities for mismatched bases in the read.
        Parameters:
        referenceBases - Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.
        referenceOffset - 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
        bisulfiteSequence - If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
      • countInsertedBases

        public static int countInsertedBases​(Cigar cigar)
      • countDeletedBases

        public static int countDeletedBases​(Cigar cigar)
      • countInsertedBases

        public static int countInsertedBases​(SAMRecord read)
      • countDeletedBases

        public static int countDeletedBases​(SAMRecord read)
      • calculateSamNmTag

        public static int calculateSamNmTag​(SAMRecord read,
                                            byte[] referenceBases)
        Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (see readBaseMatchesRefBaseWithAmbiguity method).
      • calculateSamNmTag

        public static int calculateSamNmTag​(SAMRecord read,
                                            byte[] referenceBases,
                                            int referenceOffset)
        Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (see readBaseMatchesRefBaseWithAmbiguity method).
        Parameters:
        referenceOffset - 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
      • calculateSamNmTag

        public static int calculateSamNmTag​(SAMRecord read,
                                            byte[] referenceBases,
                                            int referenceOffset,
                                            boolean bisulfiteSequence)
        Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (see readBaseMatchesRefBaseWithAmbiguity method).
        Parameters:
        referenceOffset - 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
        bisulfiteSequence - If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
      • calculateSamNmTagFromCigar

        public static int calculateSamNmTagFromCigar​(SAMRecord record)
        Attempts to calculate the predefined NM tag from the SAM spec using the cigar string alone. It may calculate incorrectly if ambiguous operators (Like M) are used. Needed for testing infrastructure: SAMRecordSetBuilder
      • complement

        public static byte complement​(byte b)
        Returns the complement of a single byte.
      • bisulfiteBasesEqual

        public static boolean bisulfiteBasesEqual​(boolean negativeStrand,
                                                  byte read,
                                                  byte reference)
        Returns true if the bases are equal OR if the mismatch can be accounted for by bisulfite treatment. C->T on the positive strand and G->A on the negative strand do not count as mismatches.
      • bisulfiteBasesEqual

        public static boolean bisulfiteBasesEqual​(byte read,
                                                  byte reference)
      • bisulfiteBasesMatchWithAmbiguity

        public static boolean bisulfiteBasesMatchWithAmbiguity​(boolean negativeStrand,
                                                               byte read,
                                                               byte reference)
        Same as above, but use readBaseMatchesRefBaseWithAmbiguity instead of basesEqual. Note that isBisulfiteConverted is not affected because it only applies when the reference base is non-ambiguous.
      • isBisulfiteConverted

        public static boolean isBisulfiteConverted​(byte read,
                                                   byte reference,
                                                   boolean negativeStrand)
        Checks for bisulfite conversion, C->T on the positive strand and G->A on the negative strand.
      • isBisulfiteConverted

        public static boolean isBisulfiteConverted​(byte read,
                                                   byte reference)
      • makeReferenceFromAlignment

        public static byte[] makeReferenceFromAlignment​(SAMRecord rec,
                                                        boolean includeReferenceBasesForDeletions)
        Produce reference bases from an aligned SAMRecord with MD string and Cigar.
        Parameters:
        rec - Must contain non-empty CIGAR and MD attribute.
        includeReferenceBasesForDeletions - If true, include reference bases that are deleted in the read. This will make the returned array not line up with the read if there are deletions.
        Returns:
        References bases corresponding to the read. If there is an insertion in the read, reference contains '-'. If the read is soft-clipped, reference contains '0'. If there is a skipped region and includeReferenceBasesForDeletions==true, reference will have Ns for the skipped region.
      • reverseComplement

        public static void reverseComplement​(byte[] bases)
        Reverses and complements the bases in place.
      • reverseQualities

        public static void reverseQualities​(byte[] quals)
        Reverses the quals in place.
      • reverse

        public static void reverse​(byte[] array,
                                   int offset,
                                   int len)
      • reverseComplement

        public static void reverseComplement​(byte[] bases,
                                             int offset,
                                             int len)
      • calculateMD5String

        public static String calculateMD5String​(byte[] data)
      • calculateMD5String

        public static String calculateMD5String​(byte[] data,
                                                int offset,
                                                int len)
      • md5DigestToString

        public static String md5DigestToString​(byte[] digest)
        Convets the result of an md5Digest to a string
        Parameters:
        digest - digest that needs to be converted to a string
        Returns:
        string representing the md5
      • calculateMD5

        public static byte[] calculateMD5​(byte[] data,
                                          int offset,
                                          int len)
      • calculateMdAndNmTags

        public static void calculateMdAndNmTags​(SAMRecord record,
                                                byte[] ref,
                                                boolean calcMD,
                                                boolean calcNM)
        Calculate MD and NM similarly to Samtools, except that N->N is a match.
        Parameters:
        record - Input record for which to calculate NM and MD. The appropriate tags will be added/updated in the record
        ref - The reference bases for the sequence to which the record is mapped
        calcMD - A flag indicating whether to update the MD tag in the record
        calcNM - A flag indicating whether to update the NM tag in the record
      • upperCase

        public static byte upperCase​(byte base)
      • upperCase

        public static byte[] upperCase​(byte[] bases)
      • generateAllKmers

        public static List<byte[]> generateAllKmers​(int length)
        Generates all possible unambiguous kmers (upper-case) of length and returns them as byte[]s.
      • getSamReadNameFromFastqHeader

        public static String getSamReadNameFromFastqHeader​(String fastqHeader)
        Returns a read name from a FASTQ header string suitable for use in a SAM/BAM file. Any letters after the first space are ignored. Ths method also strips trailing "/1" or "/2" so that paired end reads have the same name.
        Parameters:
        fastqHeader - the header from a FastqRecord.
        Returns:
        a read name appropriate for output in a SAM/BAM file.
      • getRandomBases

        public static byte[] getRandomBases​(Random random,
                                            int length)
        Returns an array of bytes containing random DNA bases
        Parameters:
        random - A Random object to use for drawing randomw bases
        length - How many bases to return.
        Returns:
        an array of random DNA bases of the requested length.
      • getRandomBases

        public static void getRandomBases​(Random random,
                                          int length,
                                          byte[] bases)
        Fills an array of bytes with random DNA bases. Will overwrite first length byte with new bases. if length is less than the size of the input array, the remaining bases will not be modified.
        Parameters:
        random - A Random object to use for drawing random bases
        length - How many bases to return.
        bases - Array to use for bases (from index 0)