Class ArrayDesignSequenceProcessingServiceImpl

    • Field Detail

      • DUPLICATE_PROBE_NAME_MUNGE_SEPARATOR

        public static final String DUPLICATE_PROBE_NAME_MUNGE_SEPARATOR
        When we encounter two probes with the same name, we add this string along with a unique identifier to the end of the name. This comes into play when the probe name is the sequence name, and the same sequence is used multiple times on the array design.
        See Also:
        Constant Field Values
    • Method Detail

      • processAffymetrixDesign

        public Collection<BioSequence> processAffymetrixDesign​(ArrayDesign arrayDesign,
                                                               InputStream probeSequenceFile,
                                                               Taxon taxon)
                                                        throws IOException
        Description copied from interface: ArrayDesignSequenceProcessingService
        Use this to add sequences to an existing Affymetrix design. The sequences will be overwritten even if they already exist (That is, if the actual ATGCs need to be replaced, but the BioSequences are already filled in). Note that probe sets are often shared by platforms - rather than creating duplicates for each, we keep a single copy. This is considered safe because Affymetrix uses unique probeset names for a given set of actual probes sequences.
        Specified by:
        processAffymetrixDesign in interface ArrayDesignSequenceProcessingService
        Parameters:
        arrayDesign - An existing ArrayDesign that already has compositeSequences filled in.
        probeSequenceFile - InputStream from a tab-delimited probe sequence file.
        taxon - validated taxon
        Returns:
        bio sequences
        Throws:
        IOException - when IO problems occur.
      • processArrayDesign

        public Collection<BioSequence> processArrayDesign​(ArrayDesign arrayDesign,
                                                          InputStream sequenceFile,
                                                          SequenceType sequenceType)
                                                   throws IOException
        Description copied from interface: ArrayDesignSequenceProcessingService
        The sequence file must provide an unambiguous way to associate the sequences with design elements on the array. If the SequenceType is AFFY_PROBE, the sequences will be treated as probes in probe sets, in Affymetrix 'tabbed' format. Otherwise the format of the file is assumed to be FASTA, with one CompositeSequence per FASTA element; there is further assumed to be just one Reporter per CompositeSequence (that is, they are the same thing). The FASTA file must use a standard defline format (as described at here) For FASTA files, the match-up of the sequence with the design element is done using the following tests, until one passes:
        1. The format line contains an explicit reference to the name of the CompositeSequence (probe id).
        2. The BioSequence for the CompositeSequences are already filled in, and there is a matching external database identifier (e.g., Genbank accession). This will only work if Genbank accessions do not re-occur in the FASTA file.
        Specified by:
        processArrayDesign in interface ArrayDesignSequenceProcessingService
        Parameters:
        arrayDesign - platform
        sequenceFile - FASTA format
        sequenceType - - e.g., SequenceType.DNA (generic), SequenceType.AFFY_PROBE, or SequenceType.OLIGO.
        Returns:
        bio sequences
        Throws:
        IOException - when IO problems occur.
        See Also:
        FastaParser
      • processArrayDesign

        public Collection<BioSequence> processArrayDesign​(ArrayDesign arrayDesign,
                                                          InputStream sequenceFile,
                                                          SequenceType sequenceType,
                                                          Taxon taxon)
                                                   throws IOException
        Description copied from interface: ArrayDesignSequenceProcessingService
        The sequence file must provide an unambiguous way to associate the sequences with design elements on the array. If probe does not have a match to a sequence in the input file, the sequence for that probe will be nulled. If the SequenceType is AFFY_PROBE, the sequences will be treated as probes in probe sets, in Affymetrix 'tabbed' format. If the SequenceType is OLIGO, the input is treated as a table (see ProbeSequenceParser; to retain semi-backwards compatibility, FASTA is detected but an exception will be thrown). Otherwise the format of the file is assumed to be FASTA, with one CompositeSequence per FASTA element; there is further assumed to be just one Reporter per CompositeSequence (that is, they are the same thing). The FASTA file must use a standard defline format (as described at here). For FASTA files, the match-up of the sequence with the design element is done using the following tests, until one passes:
        1. The format line contains an explicit reference to the name of the CompositeSequence (probe id)
        2. The format line sequence name matches the CompositeSequence name with a suffix added to disambiguate duplicates. That is, sometimes the same sequence appears on the array more than once, and this is the identifier used for the probe; we add something like "___[string]" to the end of probe name in this case. For example, a sequence with name M100000439 will match probes named M100000439 as well as M100000439___Dup1.
        3. The BioSequence for the CompositeSequences are already filled in, and there is a matching external database identifier (e.g., Genbank accession). This will only work if Genbank accessions do not re-occur in the FASTA file.
        Specified by:
        processArrayDesign in interface ArrayDesignSequenceProcessingService
        Parameters:
        arrayDesign - platform
        sequenceFile - FASTA, Affymetrix or tabbed format (depending on the type)
        sequenceType - - e.g., SequenceType.DNA (generic), SequenceType.AFFY_PROBE, or SequenceType.OLIGO.
        taxon - - if null, attempt to determine it from the array design.
        Returns:
        bio sequences
        Throws:
        IOException - when IO problems occur.
        See Also:
        FastaParser
      • processArrayDesign

        public Collection<BioSequence> processArrayDesign​(ArrayDesign arrayDesign,
                                                          InputStream sequenceIdentifierFile,
                                                          String[] databaseNames,
                                                          String blastDbHome,
                                                          Taxon taxon,
                                                          boolean force)
                                                   throws IOException
        Description copied from interface: ArrayDesignSequenceProcessingService
        Intended for use with array designs that use sequences that are in genbank, but the accessions need to be assigned after the array is already in the system. This happens when only partial or incorrect information is in GEO, for example, when Refseq ids are provided instead of the EST clone that was arrayed. This method ALWAYS clobbers the BioSequence associations that are associated with the array design (at least, if any of the probe identifiers in the file given match the array design).
        Specified by:
        processArrayDesign in interface ArrayDesignSequenceProcessingService
        Parameters:
        arrayDesign - plaftorm
        sequenceIdentifierFile - Sequence file has two columns: column 1 is a probe id, column 2 is a genbank accession or sequence name, delimited by tab. Sequences will be fetched from BLAST databases if possible; ones missing will be sought directly in Gemma.
        databaseNames - database names
        blastDbHome - blast db home
        taxon - taxon
        force - If true, if an existing BioSequence that matches is found in the system, any existing sequence information in the BioSequence will be overwritten.
        Returns:
        bio sequences
        Throws:
        IOException - when IO problems occur.
      • processArrayDesign

        public Collection<BioSequence> processArrayDesign​(ArrayDesign arrayDesign,
                                                          String[] databaseNames,
                                                          String blastDbHome,
                                                          boolean force)
        Description copied from interface: ArrayDesignSequenceProcessingService
        For the case where the sequences are retrieved simply by the Genbank accession. For this to work, the array design must already have the biosequence objects, but they haven't been populated with the actual sequences (if they have, the values will be replaced if force=true) Sequences that appear to be IMAGE clones are given another check and the Genbank accession used to retrieve the sequence is based on that, not the one provided in the Biosequence; if it differs it will be replaced. This happens when the Genbank accession is for a Refseq (for example) but the actual clone on the array is from IMAGE.
        Specified by:
        processArrayDesign in interface ArrayDesignSequenceProcessingService
        Parameters:
        arrayDesign - platform
        databaseNames - the names of the BLAST-formatted databases to search (e.g., nt, est_mouse)
        blastDbHome - where to find the blast databases for sequence retrieval
        force - If true, then when an existing BioSequence contains a non-empty sequence value, it will be overwritten with a new one.
        Returns:
        bio sequences
      • processArrayDesign

        public Collection<BioSequence> processArrayDesign​(ArrayDesign arrayDesign,
                                                          String[] databaseNames,
                                                          String blastDbHome,
                                                          boolean force,
                                                          FastaCmd fc)
        Description copied from interface: ArrayDesignSequenceProcessingService
        Provided primarily for testing.
        Specified by:
        processArrayDesign in interface ArrayDesignSequenceProcessingService
        Parameters:
        arrayDesign - platform
        databaseNames - the names of the BLAST-formatted databases to search (e.g., nt, est_mouse)
        blastDbHome - where to find the blast databases for sequence retrieval
        force - If true, then when an existing BioSequence contains a non-empty sequence value, it will be overwritten with a new one.
        fc - fasta command
        Returns:
        bio sequences
      • processSingleAccession

        public BioSequence processSingleAccession​(String sequenceId,
                                                  String[] databaseNames,
                                                  String blastDbHome,
                                                  boolean force)
        Update a single sequence in the system.
        Specified by:
        processSingleAccession in interface ArrayDesignSequenceProcessingService
        Parameters:
        force - If true, if an existing BioSequence that matches if found in the system, any existing sequence information in the BioSequence will be overwritten.
        sequenceId - sequence id
        databaseNames - database names
        blastDbHome - blast db home
        Returns:
        persistent BioSequence.
      • validateTaxon

        public Taxon validateTaxon​(Taxon taxon,
                                   ArrayDesign arrayDesign)
                            throws IllegalArgumentException
        If taxon is null then it has not been provided on the command line, then deduce the taxon from the arrayDesign. If there are 0 or more than one taxon on the array design throw an error as this programme can only be run for 1 taxon at a time if processing from a file.
        Specified by:
        validateTaxon in interface ArrayDesignSequenceProcessingService
        Parameters:
        taxon - Taxon as passed in on the command line
        arrayDesign - Array design to process
        Returns:
        taxon Taxon to process
        Throws:
        IllegalArgumentException - Thrown when there is not exactly 1 taxon.