Interface ArrayDesignSequenceProcessingService

    • Method Detail

      • assignSequencesToDesignElements

        void assignSequencesToDesignElements​(Collection<CompositeSequence> designElements,
                                             Collection<BioSequence> sequences)
        Associate sequences with an array design.
        Parameters:
        sequences - , for Affymetrix these should be the Collapsed probe sequences.
        designElements - design elements
      • assignSequencesToDesignElements

        void assignSequencesToDesignElements​(Collection<CompositeSequence> designElements,
                                             File fastaFile)
                                      throws IOException
        Associate sequences with an array design. It is assumed that the name of the sequences can be matched to the name of a design element.
        Parameters:
        designElements - design elements
        fastaFile - fasta file
        Throws:
        IOException - when IO problems occur.
      • processAffymetrixDesign

        Collection<BioSequence> processAffymetrixDesign​(ArrayDesign arrayDesign,
                                                        InputStream probeSequenceFile,
                                                        Taxon taxon)
                                                 throws IOException
        Use this to add sequences to an existing Affymetrix design. The sequences will be overwritten even if they already exist (That is, if the actual ATGCs need to be replaced, but the BioSequences are already filled in). Note that probe sets are often shared by platforms - rather than creating duplicates for each, we keep a single copy. This is considered safe because Affymetrix uses unique probeset names for a given set of actual probes sequences.
        Parameters:
        arrayDesign - An existing ArrayDesign that already has compositeSequences filled in.
        probeSequenceFile - InputStream from a tab-delimited probe sequence file.
        taxon - validated taxon
        Returns:
        bio sequences
        Throws:
        IOException - when IO problems occur.
      • processArrayDesign

        Collection<BioSequence> processArrayDesign​(ArrayDesign arrayDesign,
                                                   InputStream sequenceFile,
                                                   SequenceType sequenceType)
                                            throws IOException
        The sequence file must provide an unambiguous way to associate the sequences with design elements on the array. If the SequenceType is AFFY_PROBE, the sequences will be treated as probes in probe sets, in Affymetrix 'tabbed' format. Otherwise the format of the file is assumed to be FASTA, with one CompositeSequence per FASTA element; there is further assumed to be just one Reporter per CompositeSequence (that is, they are the same thing). The FASTA file must use a standard defline format (as described at here) For FASTA files, the match-up of the sequence with the design element is done using the following tests, until one passes:
        1. The format line contains an explicit reference to the name of the CompositeSequence (probe id).
        2. The BioSequence for the CompositeSequences are already filled in, and there is a matching external database identifier (e.g., Genbank accession). This will only work if Genbank accessions do not re-occur in the FASTA file.
        Parameters:
        sequenceFile - FASTA format
        sequenceType - - e.g., SequenceType.DNA (generic), SequenceType.AFFY_PROBE, or SequenceType.OLIGO.
        arrayDesign - platform
        Returns:
        bio sequences
        Throws:
        IOException - when IO problems occur.
        See Also:
        FastaParser
      • processArrayDesign

        Collection<BioSequence> processArrayDesign​(ArrayDesign arrayDesign,
                                                   InputStream sequenceFile,
                                                   SequenceType sequenceType,
                                                   Taxon taxon)
                                            throws IOException
        The sequence file must provide an unambiguous way to associate the sequences with design elements on the array. If probe does not have a match to a sequence in the input file, the sequence for that probe will be nulled. If the SequenceType is AFFY_PROBE, the sequences will be treated as probes in probe sets, in Affymetrix 'tabbed' format. If the SequenceType is OLIGO, the input is treated as a table (see ProbeSequenceParser; to retain semi-backwards compatibility, FASTA is detected but an exception will be thrown). Otherwise the format of the file is assumed to be FASTA, with one CompositeSequence per FASTA element; there is further assumed to be just one Reporter per CompositeSequence (that is, they are the same thing). The FASTA file must use a standard defline format (as described at here). For FASTA files, the match-up of the sequence with the design element is done using the following tests, until one passes:
        1. The format line contains an explicit reference to the name of the CompositeSequence (probe id)
        2. The format line sequence name matches the CompositeSequence name with a suffix added to disambiguate duplicates. That is, sometimes the same sequence appears on the array more than once, and this is the identifier used for the probe; we add something like "___[string]" to the end of probe name in this case. For example, a sequence with name M100000439 will match probes named M100000439 as well as M100000439___Dup1.
        3. The BioSequence for the CompositeSequences are already filled in, and there is a matching external database identifier (e.g., Genbank accession). This will only work if Genbank accessions do not re-occur in the FASTA file.
        Parameters:
        sequenceFile - FASTA, Affymetrix or tabbed format (depending on the type)
        sequenceType - - e.g., SequenceType.DNA (generic), SequenceType.AFFY_PROBE, or SequenceType.OLIGO.
        taxon - - if null, attempt to determine it from the array design.
        arrayDesign - platform
        Returns:
        bio sequences
        Throws:
        IOException - when IO problems occur.
        See Also:
        FastaParser
      • processArrayDesign

        Collection<BioSequence> processArrayDesign​(ArrayDesign arrayDesign,
                                                   InputStream sequenceFile,
                                                   InputStream sequenceIdentifierFile,
                                                   SequenceType sequenceType,
                                                   Taxon taxon)
                                            throws IOException
        Read from FASTA file when the sequence file lacks any way to link the sequences back to the probes. Provide the idFile to do so.
        Parameters:
        arrayDesign - platform
        sequenceFile - FASTA
        sequenceIdentifierFile - two columns of probe ids and sequence IDs (the same ones in the sequenceFile)
        taxon - - if null, attempt to determine it from the array design
        Returns:
        biosequences
        Throws:
        IOException
      • processArrayDesign

        Collection<BioSequence> processArrayDesign​(ArrayDesign arrayDesign,
                                                   InputStream sequenceIdentifierFile,
                                                   String[] databaseNames,
                                                   String blastDbHome,
                                                   Taxon taxon,
                                                   boolean force)
                                            throws IOException
        Intended for use with array designs that use sequences that are in genbank, but the accessions need to be assigned after the array is already in the system. This happens when only partial or incorrect information is in GEO, for example, when Refseq ids are provided instead of the EST clone that was arrayed. This method ALWAYS clobbers the BioSequence associations that are associated with the array design (at least, if any of the probe identifiers in the file given match the array design).
        Parameters:
        sequenceIdentifierFile - Sequence file has two columns: column 1 is a probe id, column 2 is a genbank accession or sequence name, delimited by tab. Sequences will be fetched from BLAST databases if possible; ones missing will be sought directly in Gemma.
        force - If true, if an existing BioSequence that matches is found in the system, any existing sequence information in the BioSequence will be overwritten.
        arrayDesign - plaftorm
        taxon - taxon
        blastDbHome - blast db home
        databaseNames - database names
        Returns:
        bio sequences
        Throws:
        IOException - when IO problems occur.
      • processArrayDesign

        Collection<BioSequence> processArrayDesign​(ArrayDesign arrayDesign,
                                                   String[] databaseNames,
                                                   String blastDbHome,
                                                   boolean force)
        For the case where the sequences are retrieved simply by the Genbank accession. For this to work, the array design must already have the biosequence objects, but they haven't been populated with the actual sequences (if they have, the values will be replaced if force=true) Sequences that appear to be IMAGE clones are given another check and the Genbank accession used to retrieve the sequence is based on that, not the one provided in the Biosequence; if it differs it will be replaced. This happens when the Genbank accession is for a Refseq (for example) but the actual clone on the array is from IMAGE.
        Parameters:
        databaseNames - the names of the BLAST-formatted databases to search (e.g., nt, est_mouse)
        blastDbHome - where to find the blast databases for sequence retrieval
        force - If true, then when an existing BioSequence contains a non-empty sequence value, it will be overwritten with a new one.
        arrayDesign - platform
        Returns:
        bio sequences
      • processArrayDesign

        Collection<BioSequence> processArrayDesign​(ArrayDesign arrayDesign,
                                                   String[] databaseNames,
                                                   String blastDbHome,
                                                   boolean force,
                                                   FastaCmd fc)
        Provided primarily for testing.
        Parameters:
        databaseNames - the names of the BLAST-formatted databases to search (e.g., nt, est_mouse)
        blastDbHome - where to find the blast databases for sequence retrieval
        force - If true, then when an existing BioSequence contains a non-empty sequence value, it will be overwritten with a new one.
        arrayDesign - platform
        fc - fasta command
        Returns:
        bio sequences
      • processSingleAccession

        BioSequence processSingleAccession​(String sequenceId,
                                           String[] databaseNames,
                                           String blastDbHome,
                                           boolean force)
        Update a single sequence in the system.
        Parameters:
        force - If true, if an existing BioSequence that matches if found in the system, any existing sequence information in the BioSequence will be overwritten.
        databaseNames - database names
        blastDbHome - blast db home
        sequenceId - sequence id
        Returns:
        persistent BioSequence.