STAR

Redefining the meaning of disease... Together!

Stargeo API

Stargeo provides API to access our data, make annotations, run analyses and more. It will be presented here in the form of copy-pastable examples. Continue reading or jump to series and samples, platforms and probes, tags, analyses or annotations.

We will start with importing requests and pandas.

In [3]:
import requests
import pandas as pd

Series and their samples

In [9]:
# Fetch first 10 series, defaults to 100
r = requests.get('http://stargeo.org/api/v2/series/?limit=10')
assert r.ok
data = r.json()

data['count'], len(data['results'])
Out[9]:
(33785, 10)
In [13]:
data['results'][0]
Out[13]:
{u'attrs': '...',
 u'gse_name': u'GSE1',
 u'platforms': [u'GPL7'],
 u'specie': u'human'}
In [ ]:
# Fetch next 10 series
r = requests.get(data['next'])

You can also fetch single serie data and its samples by gse name.

In [16]:
# Fetch GSE1 serie data
requests.get('http://stargeo.org/api/v2/series/GSE1/').text
Out[16]:
u'{"platforms":["GPL7"],"attrs":{"status":"Public on Jan 22 2001","contact_address":" ","relation":"BioProject: http://www.ncbi.nlm.nih.gov/bioproject/PRJNA84463","sample_id":"GSM11 GSM12 GSM13 GSM14 GSM15 GSM16 GSM17 GSM18 GSM19 GSM20 GSM21 GSM22 GSM23 GSM24 GSM25 GSM26 GSM27 GSM28 GSM29 GSM30 GSM31 GSM32 GSM33 GSM34 GSM35 GSM36 GSM37 GSM38 GSM39 GSM40 GSM41 GSM42 GSM43 GSM44 GSM45 GSM46 GSM47 GSM48 ","contact_name":"Michael,,Bittner","contact_country":"USA","title":"NHGRI_Melanoma_class","contact_institute":"NHGRI, NIH","sample_taxid":"9606","pubmed_id":"10952317","type":"Expression profiling by array","submission_date":"Jan 22 2001","contact_state":"MD","contact_zip_postal_code":"20892","geo_accession":"GSE1","contact_email":"mbittner@nhgri.nih.gov","last_update_date":"Jul 18 2016","contact_web_link":"http://www.nhgri.nih.gov/Intramural_research/People/bittnerm.html","contact_city":"Bethesda","contact_phone":"301-496-7980","summary":"This series represents a group of cutaneous malignant melanomas and unrelated controls which were clustered based on correlation coefficients calculated through a comparison of gene expression|\\n|profiles.|\\n|Keywords: other","platform_id":"GPL7","contact_department":"Cancer Genetics Branch","contact_fax":"301-402-3241","platform_taxid":"9606"},"specie":"human","gse_name":"GSE1"}'
In [19]:
# Fetch GSE1 samples
samples_json = requests.get('http://stargeo.org/api/v2/series/GSE1/samples/').json()
# or 
samples = pd.read_json('http://stargeo.org/api/v2/series/GSE1/samples/')
samples.head()
Out[19]:
attrs gpl_name gse_name gsm_name
0 {u'submission_date': u'Jan 08 2001', u'contact... GPL7 GSE1 GSM20
1 {u'submission_date': u'Jan 08 2001', u'contact... GPL7 GSE1 GSM15
2 {u'submission_date': u'Jan 08 2001', u'contact... GPL7 GSE1 GSM12
3 {u'submission_date': u'Jan 08 2001', u'contact... GPL7 GSE1 GSM18
4 {u'submission_date': u'Jan 08 2001', u'contact... GPL7 GSE1 GSM19

Platforms

In [24]:
# Fetch first 100 platforms, fetching the rest same way as with series above 
r = requests.get('http://stargeo.org/api/v2/platforms/').json()
platforms = r['results']
platforms[:2]
Out[24]:
[{u'gpl_name': u'GPL7',
  u'probes_matched': 6334,
  u'probes_total': 8192,
  u'specie': u'human'},
 {u'gpl_name': u'GPL96',
  u'probes_matched': 20883,
  u'probes_total': 22283,
  u'specie': u'human'}]
In [26]:
# Fetch single platform by gpl name
requests.get('http://stargeo.org/api/v2/platforms/GPL7/').json()
Out[26]:
{u'gpl_name': u'GPL7',
 u'probes_matched': 6334,
 u'probes_total': 8192,
 u'specie': u'human'}

Platform probes

In [30]:
probes = pd.read_json('http://stargeo.org/api/v2/platforms/GPL7/probes/', orient='split')
len(probes)
Out[30]:
6334
In [31]:
probes.head()
Out[31]:
probe mygene_sym mygene_entrez
0 5988 ANKRD55 79722
1 5989 DOLPP1 57171
2 5980 SNCA 6622
3 5981 VWA8 23078
4 5986 SCN8A 6334

Tags

In [7]:
# Fetch all samples
tags = requests.get('http://stargeo.org/api/v2/tags/').json()
# or
tags = pd.read_json('http://stargeo.org/api/v2/tags/')
tags[tags.concept_name != ''].head()
Out[7]:
concept_full_id concept_name description id ontology_id tag_name
27 http://purl.obolibrary.org/obo/DOID_12206 dengue hemorrhagic fever Dengue hemorrhagic fever (DOID:12206) 7 DOID DHF
28 http://purl.obolibrary.org/obo/DOID_9119 acute myeloid leukemia acute myeloid leukemia (DOID:9119) 117 DOID AML_Tissue
29 http://purl.obolibrary.org/obo/DOID_9206 Barrett's esophagus Barrett's esophagus (DOID:9206) 77 DOID BE_Tissue
30 http://purl.obolibrary.org/obo/DOID_10608 celiac disease celiac disease (DOID:10608) control 180 DOID celiac_control
31 http://purl.obolibrary.org/obo/DOID_12140 Chagas disease Control for Chagas 89 DOID Chagas_control

Fetch single tag info:

In [8]:
# Fetch tag with id 7 data
requests.get('http://stargeo.org/api/v2/tags/7/').json()
Out[8]:
{u'concept_full_id': u'http://purl.obolibrary.org/obo/DOID_12206',
 u'concept_name': u'dengue hemorrhagic fever',
 u'description': u'Dengue hemorrhagic fever (DOID:12206)',
 u'id': 7,
 u'ontology_id': u'DOID',
 u'tag_name': u'DHF'}

Analyses

Stargeo API provides a way to list and load existing analyses and load their results as well as source data and fold changes. Additionally an authorized user can perform new analyses.

In [14]:
data = requests.get('http://stargeo.org/api/v2/analysis/').json()
data['count'], len(data['results'])
Out[14]:
(136, 100)
In [15]:
analysis = data['results'][0]
# or
analysis = requests.get('http://stargeo.org/api/v2/analysis/243/').json()
analysis
Out[15]:
{u'analysis_name': u'hypertension',
 u'case_query': u"PHT == 'PHT' or hypertension == 'hypertension'",
 u'control_query': u"PHT_Control == 'PHT_Control' or hypertension_control == 'hypertension_control'",
 u'description': u'hypertension (DOID:10763)',
 u'df': u'http://analysis-df.stargeo.io.s3.amazonaws.com/243-hypertension',
 u'fold_changes': u'http://fold-changes.stargeo.io.s3.amazonaws.com/243-hypertension',
 u'id': 243,
 u'min_samples': 3,
 u'modifier_query': u'',
 u'platform_count': 6,
 u'sample_count': 309,
 u'series_count': 7,
 u'specie': u'',
 u'success': True}

Source and fold changes dataframes are accessible via links in corresponding analysis fields.

In [17]:
# Fetch source dataframe
analysis_df = pd.read_json(analysis['df'], orient='split')
analysis_df.head()
Out[17]:
series_id platform_id sample_id gsm_name gse_name gpl_name pht hypertension_control hypertension pht_control sample_class
0 202 4 8089 GSM271847 GSE10767 GPL570 pht_control 0
1 202 4 8090 GSM271848 GSE10767 GPL570 pht_control 0
2 202 4 8091 GSM271849 GSE10767 GPL570 pht_control 0
3 202 4 8092 GSM271865 GSE10767 GPL570 pht 1
4 202 4 8093 GSM271866 GSE10767 GPL570 pht 1
In [35]:
# Fetch fold changes. WARNING: this could be big
r = requests.get(analysis['fold_changes'])

# It is also compressed with zlib
import zlib
fold_changes = pd.read_json(zlib.decompress(r.content), orient='split')
fold_changes.head()
Out[35]:
probe dataMu dataSigma dataCount caseDataMu caseDataSigma caseDataCount controlDataMu controlDataSigma controlDataCount ... log2foldChange effect_size ttest p direction subset gpl gse mygene_entrez mygene_sym
0 1007_s_at 7.121507 0.542821 7 7.463592 0.464840 4 6.665394 0.117251 3 ... 0.798197 1.470462 2.842841 0.036126 up NA GPL570 GSE10767 780 DDR1
1 1053_at 9.099162 0.378195 7 9.161432 0.426845 4 9.016136 0.371085 3 ... 0.145296 0.384184 0.469187 0.658683 up NA GPL570 GSE10767 5982 RFC2
2 117_at 4.888623 0.503787 7 4.862397 0.707167 4 4.923591 0.089813 3 ... -0.061194 -0.121468 -0.145489 0.890008 down NA GPL570 GSE10767 3310 HSPA6
3 121_at 7.083546 0.281271 7 7.243619 0.269283 4 6.870115 0.094835 3 ... 0.373504 1.327916 2.253205 0.073979 up NA GPL570 GSE10767 7849 PAX8
4 1255_g_at 2.841085 0.956595 7 2.736909 0.767581 4 2.979985 1.345662 3 ... -0.243075 -0.254105 -0.306554 0.771538 down NA GPL570 GSE10767 2978 GUCA1A

5 rows × 21 columns

In [36]:
results = pd.read_json('http://stargeo.org/api/v2/analysis/243/results/', orient='split')
results.head()
Out[36]:
mygene_entrez direction k casedatacount controldatacount random_pval random_te random_se random_lower random_upper ... tau2_se c h h_lower h_upper i2 i2_lower i2_upper q q_df
BLM 641 up 7 203 106 0.004175 0.009608 0.003354 0.003034 0.016181 ... NaN 18535.273509 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 5.801143 6
A1BG 1 up 5 127 78 0.745816 0.007206 0.022229 -0.036361 0.050773 ... NaN 5559.082818 1.588522 1.000000 2.595664 0.603710 0.000000 0.851576 10.093607 4
A1BG-AS1 503538 up 2 30 16 0.092131 0.031295 0.018581 -0.005123 0.067713 ... NaN 2.505217 NaN NaN NaN NaN NaN NaN 0.562633 1
A1CF 29974 up 5 172 88 0.104480 0.002854 0.001758 -0.000591 0.006299 ... NaN 31493.161994 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 1.374130 4
A2M 2 down 7 203 106 0.689928 -0.003276 0.008211 -0.019368 0.012817 ... NaN 101435.271415 1.986816 1.362318 2.897590 0.746671 0.461181 0.880896 23.684635 6

5 rows × 34 columns

To create and start an analysis you need to provide an auth token. You can get see yours in the example below once you are logged in.

In [52]:
# This is your auth token, don't share it with anybody
headers = {'Authorization': 'Token your-token-here'}
# Create new analysis
r = requests.post('http://stargeo.org/api/v2/analysis/', headers=headers, data={
    'analysis_name': 'Young Severe Dengue',
    'description': 'Dengue cases in patients under 9',
    'specie': 'human',
    'case_query': "DHF=='DHF' or DSS=='DSS'",
    'control_query': "DF=='DF'",
    'modifier_query': "Age < 9",
})
r.json()
Out[52]:
{u'created': 385}

Annotations

Allows listing, fetching and adding annotations. Note that at each point in time you see best available annotations, along with their reliability characteristics. Pay attention at best_cohens_kappa attribute, we consider annotation validated when it equals 1, meaning there are two annotation authors that blindly did it the same way.

In [37]:
annotations = requests.get('http://stargeo.org/api/v2/annotations/').json()
annotations['count'], len(annotations['results'])
Out[37]:
(16791, 100)
In [45]:
annotations['results'][0]
Out[45]:
{u'annotations': 2,
 u'authors': 2,
 u'best_cohens_kappa': 1.0,
 u'captive': False,
 u'column': u'sample_source_name_ch1',
 u'fleiss_kappa': 1.0,
 u'gpl_name': u'GPL7',
 u'gse_name': u'GSE1',
 u'id': 1607,
 u'regex': u'melanoma',
 u'samples': 38,
 u'tag_id': 123}
In [49]:
print(requests.get('http://stargeo.org/api/v2/annotations/1607/samples/').text)
{"GSM48":"melanoma","GSM46":"melanoma","GSM47":"melanoma","GSM44":"melanoma","GSM45":"melanoma","GSM42":"melanoma","GSM43":"melanoma","GSM40":"","GSM41":"melanoma","GSM11":"melanoma","GSM13":"melanoma","GSM12":"","GSM15":"melanoma","GSM14":"melanoma","GSM39":"melanoma","GSM38":"melanoma","GSM37":"melanoma","GSM36":"melanoma","GSM35":"melanoma","GSM34":"melanoma","GSM33":"melanoma","GSM32":"melanoma","GSM31":"melanoma","GSM30":"melanoma","GSM17":"melanoma","GSM16":"melanoma","GSM19":"","GSM18":"","GSM28":"melanoma","GSM29":"melanoma","GSM20":"","GSM21":"","GSM22":"","GSM23":"melanoma","GSM24":"melanoma","GSM25":"melanoma","GSM26":"melanoma","GSM27":"melanoma"}

To post annotations you need to authorize as a competent user. To authorize you need to send Authorization token same as when creating analysis.

In [61]:
# This is your auth token, don't share it with anybody
headers = {'Authorization': 'Token your-token-here'}
# Create new analysis
r = requests.post('http://localhost:5000/api/v2/annotations/', headers=headers, json={
    'tag': 'melanoma',
    'series': 'GSE1',
    'platform': 'GPL7',
    # Need to provide full set of samples
    'annotations': {'GSM11': 'melanoma', 'GSM12': '', ...},
    # Optional text note
    'note': '...',  
})
assert r.ok