Class DatasetCombiner


  • public class DatasetCombiner
    extends Object
    Class to handle cases where there are multiple GEO dataset for a single actual experiment. This can occur in at least two ways:
    1. There is a single GSE (e.g., GSE674) but two datasets (GDS472, GDS473). This can happen when there are two different microarrays used such as the "A" and B" HG-U133 Affymetrix arrays. (Each GDS can only refer to a single platform)
    2. Rarely, there can be two series, as well as two data sets, for the situation described above. These are 'pathological' (due to incorrect data entry by a user, back in the day) and GEO folks should be removing them eventually.

    One major problem is figuring out which samples (GSMs) correspond across the datasets. In the example of GSE674, there are samples like C6-U133A (in GDS472) and C6-133B (in GDS473), which apparently, but not "officially" correspond to the same biological RNA. The difficulty is that there is no fail-proof way to determine which samples match up. We do the best we can by using the edit distance between the sample names. Ties can be a problem but for now the samples are sorted and the first best match is the one kept, on the assumption that corresponding samples will have lower numbers. (that is, sample 12929 will match with 12945, not 12955, if the edit distance among the choices is the same).

    Another problem is that there is no way to go from GDS-->GSE-->other GDS without scraping the GEO web site.
    Author:
    pavlidis
    • Constructor Detail

      • DatasetCombiner

        public DatasetCombiner()
      • DatasetCombiner

        public DatasetCombiner​(boolean doSampleMatching)
    • Method Detail

      • findGDSforGSE

        public static Collection<String> findGDSforGSE​(Collection<String> seriesAccessions)
        Given GEO series ids, find all associated data sets.
        Parameters:
        seriesAccessions - accessions
        Returns:
        a collection of associated GDS accessions. If no GDS is found, the collection will be empty.
      • findGDSforGSE

        public static Collection<String> findGDSforGSE​(String seriesAccession)
        Parameters:
        seriesAccession - series accession
        Returns:
        GDSs that correspond to the given series. It will be empty if there is no GDS matching.
      • findGSEforGDS

        public static Collection<String> findGSEforGDS​(String datasetAccession)
        Given a GDS, find the corresponding GSEs (there can be more than one in rare cases).
        Parameters:
        datasetAccession - dataset accession
        Returns:
        Collection of series this data set is derived from (this is almost always just a single item).
      • findGDSforGDS

        public Collection<String> findGDSforGDS​(String datasetAccession)
        Given a GEO dataset id, find all GDS ids that are associated with it.
        Parameters:
        datasetAccession - the geo accession
        Returns:
        all GDS associated with the given accession
      • findGSECorrespondence

        public GeoSampleCorrespondence findGSECorrespondence​(Collection<GeoDataset> dataSets)
        Try to line up samples across datasets.
        Parameters:
        dataSets - datasets
        Returns:
        sample correspondence
      • findGSECorrespondence

        public GeoSampleCorrespondence findGSECorrespondence​(GeoSeries series)
        Try to line up samples across datasets contained in a series.
        Parameters:
        series - geo series
        Returns:
        geo sample correspondence