LINCS Workflow

Find attributes about genes and proteins for machine learning

DCIC processed datasets ready for machine learning

The BD2K-LINCS DCIC created a resource that contain processed datasets ready for machine learning to learn new knowledge about genes and proteins. This resource is called the Harmonizome. The Harmonizome datasets are organized as large feature tables, where the genes are the rows and the attributes are the columns. Every attribute (column label) is associated with a gene set (rows in the column). For example, a machine learning expert can select any gene set from any dataset to represent classification labels, and then build a classifier to train and predict gene labels from the remaining datasets.

Machine Learning with LINCS

The Harmonizome has over one hundred preprocessed datasets ready for machine learning. These datasets are free and available for download.

For more information on the datasets that are available, and how these were processed, you can watch the three lectures the DCIC prepared for their course on Coursera:

The Harmonizome Concept
The Harmonizome Datasets, Part I
The Harmonizome Datasets, Part II

Workflows

To see all available datasets
  1. Visit the Harmonizome downloads page
  2. Filter the tables using the search box on the top right corner
  3. Sort the tables by clicking on the column labels
  4. Select a dataset by clicking on a dataset name in the second column
To find relevant gene sets or attributes based on plaintext queries
  1. Visit the Harmonizome home page
  2. Type keyword(s) into the search bar, for example "breast cancer"
  3. Filter the results by gene sets by clicking the pink button that says "Gene Set" at the top of the page. Search results will look like this:http://amp.pharm.mssm.edu/Harmonizome/search?q=breast%20cancer&t=geneSet
To use the processed data based on your search terms
  1. For example, if we selected the following gene set: http://amp.pharm.mssm.edu/Harmonizome/gene_set/breast+cancer/DISEASES+Curated+Gene-Disease+Assocation+Evidence+Scores
  2. To download all the data, first visit the associated dataset page, in this case: http://amp.pharm.mssm.edu/Harmonizome/dataset/DISEASES+Curated+Gene-Disease+Assocation+Evidence+Scores
  3. Click on one of the links in the "Downloads" section, for example: http://amp.pharm.mssm.edu/static/hdfs/harmonizome/data/jensendiseasecurated/gene_attribute_matrix.txt.gz
  4. Open the file in a text or spreadsheet editor