java.lang.Object
ubic.gemma.core.loader.expression.geo.model.GeoValues
All Implemented Interfaces:
Serializable

public class GeoValues extends Object implements Serializable
Class to store the expression data prior to conversion. The data are read from series files sample by sample, and within each sample designElement by designElement, and within each designElement, quantitationType by quantitationType. Values are stored in vectors, roughly equivalent to DesignElementDataVectors. This is an important class as it encompasses how we convert GEO sample data into vectors. There are a couple of assumptions that this is predicated on. First, we assume that all samples are presented with their quantitation types in the same order. Second, we assume that all samples have the same quantitation type, OR at worst, some are missing off the 'end' for some samples (in which case the vectors are padded). We do not assume that all samples have quantitation types with the same names (quantitation types correspond to column names in the GEO files). There are two counterexamples we have found (so far) that push or violate these assumptions: GSE360 and GSE4345 (which is really broken). Loading GSE4345 results in a cast exception because the quantitation types are 'mixed up' across the samples.
Author:
pavlidis
See Also:
  • Constructor Details

    • GeoValues

      public GeoValues()
  • Method Details

    • addQuantitationType

      public void addQuantitationType(GeoPlatform platform, String columnName, Integer index)
      Parameters:
      platform - platform
      columnName - column name
      index - - the actual index of the data in the final data structure, not necessarily the column where the data are found in the data file (as that can vary from sample to sample).
    • addSample

      public void addSample(GeoSample sample)
      Only call this to add a sample for which there are no data.
      Parameters:
      sample - geo sample
    • addValue

      public void addValue(GeoSample sample, Integer quantitationTypeIndex, String designElement, String value)
      Store a value. It is assumed that designElements have unique names. Implementation note: The first time we see a sample, we associate it with a 'dimension' that is connected to the platform and quantitation type. In parallel, we add the data to a 'vector' for the designElement that is likewise connected to the platform the sample uses, the quantitation type. Because in GEO files samples are seen one at a time, the vectors for each designElement are built up. Thus it is important that we add a value for each sample for each design element. Note what happens if data is MISSING for a given designElement/quantitationType/sample combination. This can happen (typically all the quantitation types for a designElement in a given sample). This method will NOT be called. When the next sample is processed, the new data will be added onto the end in the wrong place. Then the data in the vectors stored here will be incorrect. Thus the GEO parser has to ensure that each vector is 'completed' before moving to the next sample.
      Parameters:
      sample - sample
      quantitationTypeIndex - The column number for the quantitation type, needed because the names of the quantitation types don't always match across samples (but hopefully the columns do). Even though the first column contains the design element name (ID_REF), the first quantitation type should be numbered 0. This is almost always a good way to match values across samples, there ARE cases where the order isn't the same for two samples in the same series.
      designElement - design element
      value - The data point to be stored.
    • clear

      public void clear(GeoPlatform geoPlatform)
      Remove the data for a given platform (use to save memory)
      Parameters:
      geoPlatform - geo platform
    • clear

      public void clear(GeoPlatform platform, List<GeoSample> datasetSamples, Integer quantitationTypeIndex)
      If possible, null out the data for a quantitation type on a given platform.
      Parameters:
      platform - platform
      datasetSamples - dataset samples
      quantitationTypeIndex - QT index
    • getIndices

      @Nullable public Integer[] getIndices(GeoPlatform platform, List<GeoSample> neededSamples, int quantitationType)
      Get the indices of the data for a set of samples - this can be used to get a slice of the data. This is inefficient but shouldn't need to be called all that frequently.
      Parameters:
      platform - platform
      neededSamples - , must be from the same platform. If we don't have data for a given sample, the index returned will be null. This can happen when some samples don't have all the quantitation types (GSE360 for example).
      quantitationType - quantitation type
      Returns:
      integer array
    • getQuantitationTypeIndex

      public Integer getQuantitationTypeIndex(GeoPlatform platform, String columnName)
    • getQuantitationTypes

      public Collection<Integer> getQuantitationTypes(GeoPlatform samplePlatform)
      Parameters:
      samplePlatform - sample platform
      Returns:
      Collection of Objects representing the quantitation types for the given platform.
    • getValues

      public List<String> getValues(GeoPlatform platform, Integer quantitationType, String designElement)
    • getValues

      public String[] getValues(GeoPlatform platform, Integer quantitationType, String designElement, Integer[] indices)
      Parameters:
      platform - platforms
      quantitationType - QT
      designElement - design element
      indices - indices
      Returns:
      a 'slice' of the data corresponding to the indices provided.
    • hasData

      public boolean hasData()
    • isWantedQuantitationType

      public boolean isWantedQuantitationType(String quantitationTypeName)
      Some quantitation types are 'skippable' - they are easily recomputed from other values, or are not necessary in the system. Skipping these makes loading the data more manageable for some data sets that are very large.
      Parameters:
      quantitationTypeName - QT name
      Returns:
      true if the name is NOT on the 'skippable' list.
    • subset

      public GeoValues subset(Collection<GeoSample> samples)
      This creates a new GeoValues that has data only for the selected samples. The quantiatation type information will be semi-deep copies. This is only needed for when we are splitting a series apart, especially when it is not along Platform lines.
      Parameters:
      samples - samples
      Returns:
      geo values
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • validate

      public void validate()
      This method can only be called once a sample has been completely processed, and before a new sample has been started.