Class ArrayDesignSequenceProcessingServiceImpl

java.lang.Object
ubic.gemma.core.loader.expression.arrayDesign.ArrayDesignSequenceProcessingServiceImpl
All Implemented Interfaces:
ArrayDesignSequenceProcessingService

@Component public class ArrayDesignSequenceProcessingServiceImpl extends Object implements ArrayDesignSequenceProcessingService
Handles collapsing the sequences, attaching sequences to DesignElements, either from provided input or via a fetch.
Author:
pavlidis
  • Field Details

    • DUPLICATE_PROBE_NAME_MUNGE_SEPARATOR

      public static final String DUPLICATE_PROBE_NAME_MUNGE_SEPARATOR
      When we encounter two probes with the same name, we add this string along with a unique identifier to the end of the name. This comes into play when the probe name is the sequence name, and the same sequence is used multiple times on the array design.
      See Also:
  • Constructor Details

  • Method Details

    • assignSequencesToDesignElements

      public void assignSequencesToDesignElements(Collection<CompositeSequence> designElements, Collection<BioSequence> sequences)
      Description copied from interface: ArrayDesignSequenceProcessingService
      Associate sequences with an array design.
      Specified by:
      assignSequencesToDesignElements in interface ArrayDesignSequenceProcessingService
      Parameters:
      designElements - design elements
      sequences - , for Affymetrix these should be the Collapsed probe sequences.
    • assignSequencesToDesignElements

      public void assignSequencesToDesignElements(Collection<CompositeSequence> designElements, File fastaFile) throws IOException
      Description copied from interface: ArrayDesignSequenceProcessingService
      Associate sequences with an array design. It is assumed that the name of the sequences can be matched to the name of a design element.
      Specified by:
      assignSequencesToDesignElements in interface ArrayDesignSequenceProcessingService
      Parameters:
      designElements - design elements
      fastaFile - fasta file
      Throws:
      IOException - when IO problems occur.
    • assignSequencesToDesignElements

      public void assignSequencesToDesignElements(Collection<CompositeSequence> designElements, InputStream fastaFile) throws IOException
      Associate sequences with an array design. It is assumed that the name of the sequences can be matched to the name of a design element. Provided for testing purposes.
      Specified by:
      assignSequencesToDesignElements in interface ArrayDesignSequenceProcessingService
      Throws:
      IOException
    • processAffymetrixDesign

      public Collection<BioSequence> processAffymetrixDesign(ArrayDesign arrayDesign, InputStream probeSequenceFile, Taxon taxon) throws IOException
      Description copied from interface: ArrayDesignSequenceProcessingService
      Use this to add sequences to an existing Affymetrix design. The sequences will be overwritten even if they already exist (That is, if the actual ATGCs need to be replaced, but the BioSequences are already filled in). Note that probe sets are often shared by platforms - rather than creating duplicates for each, we keep a single copy. This is considered safe because Affymetrix uses unique probeset names for a given set of actual probes sequences.
      Specified by:
      processAffymetrixDesign in interface ArrayDesignSequenceProcessingService
      Parameters:
      arrayDesign - An existing ArrayDesign that already has compositeSequences filled in.
      probeSequenceFile - InputStream from a tab-delimited probe sequence file.
      taxon - validated taxon
      Returns:
      bio sequences
      Throws:
      IOException - when IO problems occur.
    • processArrayDesign

      public Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceFile, SequenceType sequenceType) throws IOException
      Description copied from interface: ArrayDesignSequenceProcessingService
      The sequence file must provide an unambiguous way to associate the sequences with design elements on the array. If the SequenceType is AFFY_PROBE, the sequences will be treated as probes in probe sets, in Affymetrix 'tabbed' format. Otherwise the format of the file is assumed to be FASTA, with one CompositeSequence per FASTA element; there is further assumed to be just one Reporter per CompositeSequence (that is, they are the same thing). The FASTA file must use a standard defline format (as described at here) For FASTA files, the match-up of the sequence with the design element is done using the following tests, until one passes:
      1. The format line contains an explicit reference to the name of the CompositeSequence (probe id).
      2. The BioSequence for the CompositeSequences are already filled in, and there is a matching external database identifier (e.g., Genbank accession). This will only work if Genbank accessions do not re-occur in the FASTA file.
      Specified by:
      processArrayDesign in interface ArrayDesignSequenceProcessingService
      Parameters:
      arrayDesign - platform
      sequenceFile - FASTA format
      sequenceType - - e.g., SequenceType.DNA (generic), SequenceType.AFFY_PROBE, or SequenceType.OLIGO.
      Returns:
      bio sequences
      Throws:
      IOException - when IO problems occur.
      See Also:
    • processArrayDesign

      public Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceFile, SequenceType sequenceType, Taxon taxon) throws IOException
      Description copied from interface: ArrayDesignSequenceProcessingService
      The sequence file must provide an unambiguous way to associate the sequences with design elements on the array. If probe does not have a match to a sequence in the input file, the sequence for that probe will be nulled. If the SequenceType is AFFY_PROBE, the sequences will be treated as probes in probe sets, in Affymetrix 'tabbed' format. If the SequenceType is OLIGO, the input is treated as a table (see ProbeSequenceParser; to retain semi-backwards compatibility, FASTA is detected but an exception will be thrown). Otherwise the format of the file is assumed to be FASTA, with one CompositeSequence per FASTA element; there is further assumed to be just one Reporter per CompositeSequence (that is, they are the same thing). The FASTA file must use a standard defline format (as described at here). For FASTA files, the match-up of the sequence with the design element is done using the following tests, until one passes:
      1. The format line contains an explicit reference to the name of the CompositeSequence (probe id)
      2. The format line sequence name matches the CompositeSequence name with a suffix added to disambiguate duplicates. That is, sometimes the same sequence appears on the array more than once, and this is the identifier used for the probe; we add something like "___[string]" to the end of probe name in this case. For example, a sequence with name M100000439 will match probes named M100000439 as well as M100000439___Dup1.
      3. The BioSequence for the CompositeSequences are already filled in, and there is a matching external database identifier (e.g., Genbank accession). This will only work if Genbank accessions do not re-occur in the FASTA file.
      Specified by:
      processArrayDesign in interface ArrayDesignSequenceProcessingService
      Parameters:
      arrayDesign - platform
      sequenceFile - FASTA, Affymetrix or tabbed format (depending on the type)
      sequenceType - - e.g., SequenceType.DNA (generic), SequenceType.AFFY_PROBE, or SequenceType.OLIGO.
      taxon - - if null, attempt to determine it from the array design.
      Returns:
      bio sequences
      Throws:
      IOException - when IO problems occur.
      See Also:
    • processArrayDesign

      public Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceFile, InputStream sequenceIdentifierFile, SequenceType sequenceType, Taxon taxon) throws IOException
      Description copied from interface: ArrayDesignSequenceProcessingService
      Read from FASTA file when the sequence file lacks any way to link the sequences back to the probes. Provide the idFile to do so.
      Specified by:
      processArrayDesign in interface ArrayDesignSequenceProcessingService
      Parameters:
      arrayDesign - platform
      sequenceFile - FASTA
      sequenceIdentifierFile - two columns of probe ids and sequence IDs (the same ones in the sequenceFile)
      taxon - - if null, attempt to determine it from the array design
      Returns:
      biosequences
      Throws:
      IOException
    • processArrayDesign

      public Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceIdentifierFile, String[] databaseNames, Taxon taxon, boolean force) throws IOException
      Description copied from interface: ArrayDesignSequenceProcessingService
      Intended for use with array designs that use sequences that are in genbank, but the accessions need to be assigned after the array is already in the system. This happens when only partial or incorrect information is in GEO, for example, when Refseq ids are provided instead of the EST clone that was arrayed. This method ALWAYS clobbers the BioSequence associations that are associated with the array design (at least, if any of the probe identifiers in the file given match the array design).
      Specified by:
      processArrayDesign in interface ArrayDesignSequenceProcessingService
      Parameters:
      arrayDesign - plaftorm
      sequenceIdentifierFile - Sequence file has two columns: column 1 is a probe id, column 2 is a genbank accession or sequence name, delimited by tab. Sequences will be fetched from BLAST databases if possible; ones missing will be sought directly in Gemma.
      databaseNames - database names
      taxon - taxon
      force - If true, if an existing BioSequence that matches is found in the system, any existing sequence information in the BioSequence will be overwritten.
      Returns:
      bio sequences
      Throws:
      IOException - when IO problems occur.
    • processArrayDesign

      public Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceIdentifierFile, String[] databaseNames, Taxon taxon, boolean force, FastaCmd fc) throws IOException
      Specified by:
      processArrayDesign in interface ArrayDesignSequenceProcessingService
      Throws:
      IOException
    • processArrayDesign

      public Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, String[] databaseNames, boolean force)
      Description copied from interface: ArrayDesignSequenceProcessingService
      For the case where the sequences are retrieved simply by the Genbank accession. For this to work, the array design must already have the biosequence objects, but they haven't been populated with the actual sequences (if they have, the values will be replaced if force=true) Sequences that appear to be IMAGE clones are given another check and the Genbank accession used to retrieve the sequence is based on that, not the one provided in the Biosequence; if it differs it will be replaced. This happens when the Genbank accession is for a Refseq (for example) but the actual clone on the array is from IMAGE.
      Specified by:
      processArrayDesign in interface ArrayDesignSequenceProcessingService
      Parameters:
      arrayDesign - platform
      databaseNames - the names of the BLAST-formatted databases to search (e.g., nt, est_mouse)
      force - If true, then when an existing BioSequence contains a non-empty sequence value, it will be overwritten with a new one.
      Returns:
      bio sequences
    • processArrayDesign

      public Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, String[] databaseNames, boolean force, FastaCmd fc)
      Description copied from interface: ArrayDesignSequenceProcessingService
      Provided primarily for testing.
      Specified by:
      processArrayDesign in interface ArrayDesignSequenceProcessingService
      Parameters:
      arrayDesign - platform
      databaseNames - the names of the BLAST-formatted databases to search (e.g., nt, est_mouse)
      force - If true, then when an existing BioSequence contains a non-empty sequence value, it will be overwritten with a new one.
      fc - fasta command
      Returns:
      bio sequences
    • processSingleAccession

      public BioSequence processSingleAccession(String sequenceId, String[] databaseNames, boolean force)
      Update a single sequence in the system.
      Specified by:
      processSingleAccession in interface ArrayDesignSequenceProcessingService
      Parameters:
      sequenceId - sequence id
      databaseNames - database names
      force - If true, if an existing BioSequence that matches if found in the system, any existing sequence information in the BioSequence will be overwritten.
      Returns:
      persistent BioSequence.
    • validateTaxon

      public Taxon validateTaxon(Taxon taxon, ArrayDesign arrayDesign) throws IllegalArgumentException
      If taxon is null then it has not been provided on the command line, then deduce the taxon from the arrayDesign. If there are 0 or more than one taxon on the array design throw an error as this programme can only be run for 1 taxon at a time if processing from a file.
      Specified by:
      validateTaxon in interface ArrayDesignSequenceProcessingService
      Parameters:
      taxon - Taxon as passed in on the command line
      arrayDesign - Array design to process
      Returns:
      taxon Taxon to process
      Throws:
      IllegalArgumentException - Thrown when there is not exactly 1 taxon.