Class SequenceManipulation

java.lang.Object
ubic.gemma.core.analysis.sequence.SequenceManipulation

public class SequenceManipulation extends Object
Convenient methods for manipulating BioSequences and PhysicalLocations
Author:
pavlidis
  • Constructor Details

    • SequenceManipulation

      public SequenceManipulation()
  • Method Details

    • blatFormatChromosomeName

      public static String blatFormatChromosomeName(String chromosome)
      Puts "chr" prefix on the chromosome name, if need be.
      Parameters:
      chromosome - chromosome
      Returns:
      formatted name
    • stripPolyAorT

      public static String stripPolyAorT(String sequence, int thresholdLength)
      Remove a 3' polyA or 5' polyT tail. The entire tail is removed.
      Parameters:
      sequence - sequence
      thresholdLength - to trigger removal.
      Returns:
      processed sequence
    • blatLocationsToIntArray

      public static int[] blatLocationsToIntArray(String blatLocations)
      Convert a psl-formatted list (comma-delimited) to an int[].
      Parameters:
      blatLocations - locations
      Returns:
      locations
    • collapse

      public static BioSequence collapse(Collection<Reporter> sequences)
      Convert a CompositeSequence's immobilizedCharacteristics into a single sequence, using a simple merge-join strategy.
      Parameters:
      sequences - sequences
      Returns:
      BioSequence. Not all fields are filled in and must be set by the caller.
    • deBlatFormatChromosomeName

      public static String deBlatFormatChromosomeName(String chromosome)
      Removes "chr" prefix from the chromosome name, if it is there.
      Parameters:
      chromosome - chromosome
      Returns:
      formatted name
    • findCenter

      public static int findCenter(String starts, String sizes)
      Find where the center of a query location is in a gene. This is defined as the location of the center base of the query sequence relative to the 3' end of the gene.
      Parameters:
      starts - starts
      sizes - sizes
      Returns:
      center
    • getGeneExonOverlaps

      public static int getGeneExonOverlaps(String chromosome, String starts, String sizes, String strand, Gene gene)
      Given a gene, find out how much of it overlaps with exons provided as starts and sizes. This could involve more than one exon.
      Parameters:
      chromosome - , as "chrX" or "X".
      starts - of the locations we are testing.
      sizes - of the locations we are testing.
      strand - to consider. If null, strand is ignored.
      gene - Gene we are testing
      Returns:
      Number of bases which overlap with exons of the gene. A value of zero indicates that the location is entirely within an intron. If multiple GeneProducts are associated with this gene, the best (highest) overlap is reported).
    • getGeneProductExonOverlap

      public static int getGeneProductExonOverlap(String starts, String sizes, String strand, GeneProduct geneProduct)
      Compute the overlap of a physical location with a transcript (gene product). This assumes that the chromosome is already matched.
      Parameters:
      starts - of the locations we are testing (in the target, so on the same coordinates as the geneProduct location is scored)
      sizes - of the locations we are testing.
      strand - the strand to look on. If null, strand is ignored.
      geneProduct - GeneProduct we are testing. If strand of PhysicalLocation is null, we ignore strand.
      Returns:
      Total number of bases which overlap with exons of the transcript. A value of zero indicates that the location is entirely within an intron, or the strand is wrong.
    • computeOverlap

      public static int computeOverlap(long starta, long enda, long startb, long endb)
    • rightHandOverlap

      public static int rightHandOverlap(BioSequence target, BioSequence query)
      Compute just any overlap the compare sequence has with the target on the right side.
      Parameters:
      target - target
      query - query
      Returns:
      right overlap
    • reverseComplement

      public static String reverseComplement(String sequence)
    • totalSize

      public static int totalSize(String sizes)
      Parameters:
      sizes - Blat-formatted string of sizes (comma-delimited)
      Returns:
      total size
    • computeOverlap

      public static int computeOverlap(PhysicalLocation a, PhysicalLocation b)
      Compute the overlap between two physical locations. If both do not have a length the overlap is zero unless they point to exactly the same nucleotide location, in which case the overlap is 1.
      Parameters:
      a - a
      b - b
      Returns:
      overlap
    • binFromRange

      public static int binFromRange(int start, int end)
      Parameters:
      start - start
      end - end
      Returns:
      bin that this start-end segment is in