Interface ArrayDesignSequenceProcessingService
-
- All Known Implementing Classes:
ArrayDesignSequenceProcessingServiceImpl
public interface ArrayDesignSequenceProcessingService
- Author:
- paul
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description void
assignSequencesToDesignElements(Collection<CompositeSequence> designElements, File fastaFile)
Associate sequences with an array design.void
assignSequencesToDesignElements(Collection<CompositeSequence> designElements, InputStream fastaFile)
void
assignSequencesToDesignElements(Collection<CompositeSequence> designElements, Collection<BioSequence> sequences)
Associate sequences with an array design.Collection<BioSequence>
processAffymetrixDesign(ArrayDesign arrayDesign, InputStream probeSequenceFile, Taxon taxon)
Use this to add sequences to an existing Affymetrix design.Collection<BioSequence>
processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceFile, InputStream sequenceIdentifierFile, SequenceType sequenceType, Taxon taxon)
Read from FASTA file when the sequence file lacks any way to link the sequences back to the probes.Collection<BioSequence>
processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceIdentifierFile, String[] databaseNames, String blastDbHome, Taxon taxon, boolean force)
Intended for use with array designs that use sequences that are in genbank, but the accessions need to be assigned after the array is already in the system.Collection<BioSequence>
processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceIdentifierFile, String[] databaseNames, String blastDbHome, Taxon taxon, boolean force, FastaCmd fc)
Collection<BioSequence>
processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceFile, SequenceType sequenceType)
The sequence file must provide an unambiguous way to associate the sequences with design elements on the array.Collection<BioSequence>
processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceFile, SequenceType sequenceType, Taxon taxon)
The sequence file must provide an unambiguous way to associate the sequences with design elements on the array.Collection<BioSequence>
processArrayDesign(ArrayDesign arrayDesign, String[] databaseNames, boolean force)
Collection<BioSequence>
processArrayDesign(ArrayDesign arrayDesign, String[] databaseNames, String blastDbHome, boolean force)
For the case where the sequences are retrieved simply by the Genbank accession.Collection<BioSequence>
processArrayDesign(ArrayDesign arrayDesign, String[] databaseNames, String blastDbHome, boolean force, FastaCmd fc)
Provided primarily for testing.BioSequence
processSingleAccession(String sequenceId, String[] databaseNames, String blastDbHome, boolean force)
Update a single sequence in the system.Taxon
validateTaxon(Taxon taxon, ArrayDesign arrayDesign)
-
-
-
Method Detail
-
assignSequencesToDesignElements
void assignSequencesToDesignElements(Collection<CompositeSequence> designElements, Collection<BioSequence> sequences)
Associate sequences with an array design.- Parameters:
sequences
- , for Affymetrix these should be the Collapsed probe sequences.designElements
- design elements
-
assignSequencesToDesignElements
void assignSequencesToDesignElements(Collection<CompositeSequence> designElements, File fastaFile) throws IOException
Associate sequences with an array design. It is assumed that the name of the sequences can be matched to the name of a design element.- Parameters:
designElements
- design elementsfastaFile
- fasta file- Throws:
IOException
- when IO problems occur.
-
assignSequencesToDesignElements
void assignSequencesToDesignElements(Collection<CompositeSequence> designElements, InputStream fastaFile) throws IOException
- Throws:
IOException
-
processAffymetrixDesign
Collection<BioSequence> processAffymetrixDesign(ArrayDesign arrayDesign, InputStream probeSequenceFile, Taxon taxon) throws IOException
Use this to add sequences to an existing Affymetrix design. The sequences will be overwritten even if they already exist (That is, if the actual ATGCs need to be replaced, but the BioSequences are already filled in). Note that probe sets are often shared by platforms - rather than creating duplicates for each, we keep a single copy. This is considered safe because Affymetrix uses unique probeset names for a given set of actual probes sequences.- Parameters:
arrayDesign
- An existing ArrayDesign that already has compositeSequences filled in.probeSequenceFile
- InputStream from a tab-delimited probe sequence file.taxon
- validated taxon- Returns:
- bio sequences
- Throws:
IOException
- when IO problems occur.
-
processArrayDesign
Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceFile, SequenceType sequenceType) throws IOException
The sequence file must provide an unambiguous way to associate the sequences with design elements on the array. If the SequenceType is AFFY_PROBE, the sequences will be treated as probes in probe sets, in Affymetrix 'tabbed' format. Otherwise the format of the file is assumed to be FASTA, with one CompositeSequence per FASTA element; there is further assumed to be just one Reporter per CompositeSequence (that is, they are the same thing). The FASTA file must use a standard defline format (as described at here) For FASTA files, the match-up of the sequence with the design element is done using the following tests, until one passes:- The format line contains an explicit reference to the name of the CompositeSequence (probe id).
- The BioSequence for the CompositeSequences are already filled in, and there is a matching external database identifier (e.g., Genbank accession). This will only work if Genbank accessions do not re-occur in the FASTA file.
- Parameters:
sequenceFile
- FASTA formatsequenceType
- - e.g., SequenceType.DNA (generic), SequenceType.AFFY_PROBE, or SequenceType.OLIGO.arrayDesign
- platform- Returns:
- bio sequences
- Throws:
IOException
- when IO problems occur.- See Also:
FastaParser
-
processArrayDesign
Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceFile, SequenceType sequenceType, Taxon taxon) throws IOException
The sequence file must provide an unambiguous way to associate the sequences with design elements on the array. If probe does not have a match to a sequence in the input file, the sequence for that probe will be nulled. If the SequenceType is AFFY_PROBE, the sequences will be treated as probes in probe sets, in Affymetrix 'tabbed' format. If the SequenceType is OLIGO, the input is treated as a table (see ProbeSequenceParser; to retain semi-backwards compatibility, FASTA is detected but an exception will be thrown). Otherwise the format of the file is assumed to be FASTA, with one CompositeSequence per FASTA element; there is further assumed to be just one Reporter per CompositeSequence (that is, they are the same thing). The FASTA file must use a standard defline format (as described at here). For FASTA files, the match-up of the sequence with the design element is done using the following tests, until one passes:- The format line contains an explicit reference to the name of the CompositeSequence (probe id)
- The format line sequence name matches the CompositeSequence name with a suffix added to disambiguate duplicates. That is, sometimes the same sequence appears on the array more than once, and this is the identifier used for the probe; we add something like "___[string]" to the end of probe name in this case. For example, a sequence with name M100000439 will match probes named M100000439 as well as M100000439___Dup1.
- The BioSequence for the CompositeSequences are already filled in, and there is a matching external database identifier (e.g., Genbank accession). This will only work if Genbank accessions do not re-occur in the FASTA file.
- Parameters:
sequenceFile
- FASTA, Affymetrix or tabbed format (depending on the type)sequenceType
- - e.g., SequenceType.DNA (generic), SequenceType.AFFY_PROBE, or SequenceType.OLIGO.taxon
- - if null, attempt to determine it from the array design.arrayDesign
- platform- Returns:
- bio sequences
- Throws:
IOException
- when IO problems occur.- See Also:
FastaParser
-
processArrayDesign
Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceFile, InputStream sequenceIdentifierFile, SequenceType sequenceType, Taxon taxon) throws IOException
Read from FASTA file when the sequence file lacks any way to link the sequences back to the probes. Provide the idFile to do so.- Parameters:
arrayDesign
- platformsequenceFile
- FASTAsequenceIdentifierFile
- two columns of probe ids and sequence IDs (the same ones in the sequenceFile)taxon
- - if null, attempt to determine it from the array design- Returns:
- biosequences
- Throws:
IOException
-
processArrayDesign
Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceIdentifierFile, String[] databaseNames, String blastDbHome, Taxon taxon, boolean force) throws IOException
Intended for use with array designs that use sequences that are in genbank, but the accessions need to be assigned after the array is already in the system. This happens when only partial or incorrect information is in GEO, for example, when Refseq ids are provided instead of the EST clone that was arrayed. This method ALWAYS clobbers the BioSequence associations that are associated with the array design (at least, if any of the probe identifiers in the file given match the array design).- Parameters:
sequenceIdentifierFile
- Sequence file has two columns: column 1 is a probe id, column 2 is a genbank accession or sequence name, delimited by tab. Sequences will be fetched from BLAST databases if possible; ones missing will be sought directly in Gemma.force
- If true, if an existing BioSequence that matches is found in the system, any existing sequence information in the BioSequence will be overwritten.arrayDesign
- plaftormtaxon
- taxonblastDbHome
- blast db homedatabaseNames
- database names- Returns:
- bio sequences
- Throws:
IOException
- when IO problems occur.
-
processArrayDesign
Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, InputStream sequenceIdentifierFile, String[] databaseNames, String blastDbHome, Taxon taxon, boolean force, FastaCmd fc) throws IOException
- Throws:
IOException
-
processArrayDesign
Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, String[] databaseNames, boolean force)
-
processArrayDesign
Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, String[] databaseNames, String blastDbHome, boolean force)
For the case where the sequences are retrieved simply by the Genbank accession. For this to work, the array design must already have the biosequence objects, but they haven't been populated with the actual sequences (if they have, the values will be replaced if force=true) Sequences that appear to be IMAGE clones are given another check and the Genbank accession used to retrieve the sequence is based on that, not the one provided in the Biosequence; if it differs it will be replaced. This happens when the Genbank accession is for a Refseq (for example) but the actual clone on the array is from IMAGE.- Parameters:
databaseNames
- the names of the BLAST-formatted databases to search (e.g., nt, est_mouse)blastDbHome
- where to find the blast databases for sequence retrievalforce
- If true, then when an existing BioSequence contains a non-empty sequence value, it will be overwritten with a new one.arrayDesign
- platform- Returns:
- bio sequences
-
processArrayDesign
Collection<BioSequence> processArrayDesign(ArrayDesign arrayDesign, String[] databaseNames, String blastDbHome, boolean force, FastaCmd fc)
Provided primarily for testing.- Parameters:
databaseNames
- the names of the BLAST-formatted databases to search (e.g., nt, est_mouse)blastDbHome
- where to find the blast databases for sequence retrievalforce
- If true, then when an existing BioSequence contains a non-empty sequence value, it will be overwritten with a new one.arrayDesign
- platformfc
- fasta command- Returns:
- bio sequences
-
processSingleAccession
BioSequence processSingleAccession(String sequenceId, String[] databaseNames, String blastDbHome, boolean force)
Update a single sequence in the system.- Parameters:
force
- If true, if an existing BioSequence that matches if found in the system, any existing sequence information in the BioSequence will be overwritten.databaseNames
- database namesblastDbHome
- blast db homesequenceId
- sequence id- Returns:
- persistent BioSequence.
-
validateTaxon
Taxon validateTaxon(Taxon taxon, ArrayDesign arrayDesign) throws IllegalArgumentException
- Throws:
IllegalArgumentException
-
-