Predicting galaxy redshifts from 3D-HST

Predicting galaxy redshifts from 3D-HST#

Last update June, 13 2023

Setup environment#

%%file requirements.txt
matplotlib
pandas
numpy
scikit-learn
tensorflow
torch
jax
jaxlib
flax
astropy
plotly
git+https://github.com/mfouesneau/ezdata

Overwriting requirements.txt

!python3 -m pip install -q -r requirements.txt

Notebook setup#

%matplotlib inline
import pandas as pd
import numpy as np
import pylab as plt
from IPython.display import Markdown, display
from ezdata.matplotlib import light_minimal
plt.style.use(light_minimal)

Predicting galaxy redshift via regression on 3D-HST photometry#

Download data#

First, download the 3D-HST catalog from the MAST archive. This dataset is described in Skelton et. al 2014.

from astropy.utils.data import download_file
import tarfile
import os


if not os.path.isdir('3dhst_master.phot.v4.1'):
    file_url = 'https://archive.stsci.edu/missions/hlsp/3d-hst/RELEASE_V4.0/Photometry/3dhst_master.phot.v4.1.tar'
    tarfile.open(download_file(file_url, cache=True), "r:")\
           .extract('3dhst_master.phot.v4.1/3dhst_master.phot.v4.1.cat', '.')
else:
    print('file already downloaded.')

file already downloaded.

Read the combined photmetric catalog into a dataframe via astropy

import pandas as pd
fname = '3dhst_master.phot.v4.1/3dhst_master.phot.v4.1.cat'
with open(fname, 'r') as fin:
    header = fin.readline()[1:]  #remove comment character
    df = pd.read_csv(fin, names=header.strip().split(), comment='#', delimiter='\s+')
df = df.set_index('id')

df

	field	ra	dec	x	y	z_spec	z_peak	faper_F140W	eaper_F140W	faper_F160W	...	flags	f140w_flag	star_flag	use_phot	near_star	nexp_f125w	nexp_f140w	nexp_f160w	lmass	Av
id
1.0	AEGIS	215.222382	53.004185	9590.50	3057.30	-1.0000	0.0100	-82.0410	-82.041000	22531.00000	...	0	0	1	0	0	2	0	3	7.51	1.1
2.0	AEGIS	215.096588	52.918053	16473.20	3150.20	-1.0000	0.0100	3.5078	0.074233	3.99050	...	3	0	0	0	0	4	4	4	5.62	0.1
3.0	AEGIS	215.161469	52.959461	13060.10	2982.30	-1.0000	0.2062	-1.9043	-1.904300	1.38110	...	1	0	0	1	0	2	0	2	9.00	1.1
4.0	AEGIS	215.305298	53.052921	5422.80	2692.10	-1.0000	0.0355	-72.3250	-72.325000	0.82549	...	0	0	0	0	0	1	0	1	4.78	0.4
5.0	AEGIS	215.041840	52.871273	19894.60	2834.40	-1.0000	0.3427	1890.5000	0.133300	-99.00000	...	0	1	2	0	1	0	3	0	11.57	0.5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
44098.0	UDS	34.363106	-5.122067	14578.84	11076.42	-1.0000	0.1852	188.8200	0.170750	-99.00000	...	0	1	2	0	0	0	3	0	10.21	0.2
44099.0	UDS	34.333569	-5.123219	16343.92	11007.12	-1.0000	0.6716	1.7193	0.129080	-99.00000	...	0	1	2	0	0	0	3	0	8.49	0.7
44100.0	UDS	34.363682	-5.123123	14544.39	11013.01	1.4196	2.7489	1.3829	0.113830	-99.00000	...	0	1	2	0	0	0	4	0	9.08	1.4
44101.0	UDS	34.556389	-5.123040	3028.23	11017.04	-1.0000	1.1716	-99.0000	-99.000000	5.51670	...	1	0	0	0	0	1	0	2	10.55	1.0
44102.0	UDS	34.365097	-5.121975	14459.90	11081.91	-1.0000	0.6160	1.5335	0.152730	-99.00000	...	0	1	2	0	0	0	3	0	8.58	0.0

207967 rows × 37 columns

Exploring the dataset#

Now we will examine the contents of the catalog.

The dataset contains standard information such as target id, field name, coordinates (ra, dec), fluxes, uncertainties, and various photometric flags (see Skelton et. al 2014, or download the README here).

In addition, there are some derived properties such as photometric redshift (z_peak), spectroscopic redshift (z_spec), mass (lmass) and dust extinction in the V band (Av).

Kriek et al (2018; FAST) describe the masse estimates.

Keep in mind that we are interested in predicting redshift, so our “target” variable will be z_spec. The “features” will be all relevant columns.

df.columns

Index(['field', 'ra', 'dec', 'x', 'y', 'z_spec', 'z_peak', 'faper_F140W',
       'eaper_F140W', 'faper_F160W', 'eaper_F160W', 'f_F606W', 'e_F606W',
       'f_F814W', 'e_F814W', 'f_F125W', 'e_F125W', 'f_F140W', 'e_F140W',
       'f_F160W', 'e_F160W', 'tot_cor', 'kron_radius', 'a_image', 'b_image',
       'flux_radius', 'fwhm_image', 'flags', 'f140w_flag', 'star_flag',
       'use_phot', 'near_star', 'nexp_f125w', 'nexp_f140w', 'nexp_f160w',
       'lmass', 'Av'],
      dtype='object')

Cleaning the dataset#

For our application, there are columns that we do not / should not use in making a model of redshift. Let’s consider the following comments:

we will drop the identifier, the field, x, y, ra, dec because we do not need to learn any anisotropy due to a limited survey.
we consider all fluxes, but note that the faper_ fields correspond to somewhat duplicated measurements. Let’s drop the latter fields.
nexp_ gives us the number of exposures, which does not seem important here.
tot_cor corresponds to a flux (aperture) correction.
kron_radius, corresponds to a KRON radius, i.e. 90% of the flux contained in an aperture size twice that radius. flux_radius similarly corresponds to the half light radius. Let’s remove those.
a_image & b_image correspond to semi-major and semi-minor axes
fwhm_image corresponds to the Gaussian fit of the objects’ cores. Not relevant for us.
star_flag if 0, corresponds to galaxy. Any other value means star.
lmass corresponds to the estimated mass of the quasar based on the redshift and luminosity. We must remove it to avoid data leakage.
Av corresponds to the extinction along the line of sight. This may be helpful to keep to tell us about the reddening and dimming of a source.
use_phot & near_star are flags about the detection and measurements. These could be interesting to use to get the best training set possible.

df.describe()

	ra	dec	x	y	z_spec	z_peak	faper_F140W	eaper_F140W	faper_F160W	eaper_F160W	...	flags	f140w_flag	star_flag	use_phot	near_star	nexp_f125w	nexp_f140w	nexp_f160w	lmass	Av
count	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000	...	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000	207967.000000
mean	122.067472	14.465094	12156.481048	10233.742518	-0.953653	1.177393	-15.424237	-22.113783	13.392734	-0.518764	...	0.500642	0.003303	1.310165	0.847005	0.044267	8.358509	4.425048	9.051638	8.410625	0.318929
std	73.831227	35.387678	7233.913627	5733.813482	0.489124	6.727479	323.776749	54.457811	596.100221	7.605170	...	1.278831	0.057380	0.943662	0.359984	0.205687	10.919348	9.310295	15.192599	1.511090	0.544946
min	34.216633	-27.959742	7.400000	1273.020000	-99.900000	-99.000000	-482.630000	-482.630000	-99.000000	-99.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-1.000000	-1.000000
25%	53.061646	-5.272032	7316.685000	6161.300000	-1.000000	0.817400	-48.633500	-48.633500	0.186090	0.026516	...	0.000000	0.000000	0.000000	1.000000	0.000000	4.000000	0.000000	4.000000	7.910000	0.000000
50%	150.091034	2.274822	10578.100000	8873.450000	-1.000000	1.415900	0.178810	0.069309	0.355380	0.035040	...	0.000000	0.000000	2.000000	1.000000	0.000000	4.000000	4.000000	4.000000	8.500000	0.100000
75%	189.306900	52.957975	15137.135000	13184.660000	-1.000000	2.128050	0.491190	0.086337	0.939190	0.042530	...	0.000000	0.000000	2.000000	1.000000	0.000000	8.000000	4.000000	8.000000	9.180000	0.500000
max	215.305298	62.388302	37668.900000	30753.300000	6.118000	5.960900	47052.000000	10592.000000	91185.000000	474.560000	...	23.000000	1.000000	2.000000	1.000000	1.000000	91.000000	84.000000	132.000000	14.630000	4.000000

8 rows × 36 columns

Looking at the values, we see that some zspec, fluxes and Av values are negative, some fluxes are set to -99.0

remove = ['field', 'ra', 'dec', 'x', 'y',
          'faper_F140W', 'eaper_F140W', 'faper_F160W', 'eaper_F160W',
          'tot_cor', 'kron_radius', 'a_image', 'b_image',
          'flux_radius', 'fwhm_image', 'flags', 'f140w_flag', 'star_flag',
          'use_phot', 'near_star', 'nexp_f125w', 'nexp_f140w', 'nexp_f160w',
          'lmass']

columns = [k for k in df.columns if k not in remove]

wrap = 6

plt.figure(figsize=(15, 10))

nlines = len(columns) // wrap +  int(len(columns) % wrap > 0) 

for e, colname in enumerate(columns, 1):
    plt.subplot(nlines, wrap, e)
    plt.hist(df[colname], bins=43, log=True)
    plt.xlabel(colname)

plt.tight_layout()

../../_images/18086dd5e496193bf7d5810290f418e351512d9e05ccff7c6c72c3a2177cfe4f.png

select = (df.z_peak > 0) & (df.z_spec > 0)
plt.plot(df[select].z_peak, df[select].z_spec, '.', rasterized=True)
plt.xlabel('z phot')
plt.ylabel('z spec')

Text(0, 0.5, 'z spec')

../../_images/fcb0df893f586ec8a5ac837786877095c0573255284e4fd735e176d926e945eb.png

flux_cols = [name for name in df.columns if name.startswith('f_F')]
positive_flux = ' & '.join([f'({col:s} > 0)' for col in flux_cols])
x_ = np.array([606, 814, 1250, 1400, 1600])

select_zphot = '(star_flag == 0) & (use_phot == 1) & (z_peak > 0)'
select_zspec = '(star_flag == 0) & (z_spec > 0)'

cmap = plt.get_cmap('jet')
norm = plt.Normalize(vmin=0, vmax=2)


plt.figure(figsize=(10, 5))

ax = plt.subplot(121)
select = df.eval(select_zphot + '&' + positive_flux)
df_ = df[select]
ind = np.random.choice(len(df_), 2_000, replace=False)
X = df_[flux_cols].to_numpy()[ind]
y = df_['z_peak'].to_numpy()[ind]

for xk, yk in zip(X, y):
    color = cmap(norm(yk))
    plt.plot(x_, x_ * xk, color=color, rasterized=True, alpha=0.4, lw=0.5)

plt.yscale('log')
sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)
sm.set_array([])
cbar = plt.colorbar(sm, ax=plt.gca())
cbar.set_label('z phot')
cbar.ax.set_visible(False)
plt.ylabel('$\lambda  \cdot $ flux')
plt.xlabel('$\lambda$ [nm]')
plt.title('z phot')


plt.subplot(122,  sharex=ax, sharey=ax)
select = df.eval(select_zspec + '&' + positive_flux)
df_ = df[select]
ind = np.random.choice(len(df_), 2_000, replace=False)
X = df_[flux_cols].to_numpy()[ind]
y = df_['z_spec'].to_numpy()[ind]
for xk, yk in zip(X, y):
    color = cmap(norm(yk))
    plt.plot(x_, x_ * xk, color=color, rasterized=True, alpha=0.4, lw=0.5)
plt.yscale('log')
sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)
sm.set_array([])
cbar = plt.colorbar(sm, ax=plt.gca())
cbar.set_label('z')
plt.title('z spec')

plt.xlabel('$\lambda$ [nm]')

plt.tight_layout()

../../_images/c98c9820710f5ca18f4730d242bfb414074397631c97346f45ef67036d76965d.png

remove = ['field', 'ra', 'dec', 'x', 'y',
          'faper_F140W', 'eaper_F140W', 'faper_F160W', 'eaper_F160W',
          'tot_cor', 'kron_radius', 'a_image', 'b_image',
          'flux_radius', 'fwhm_image', 'flags', 'f140w_flag', 'star_flag',
          'use_phot', 'near_star', 'nexp_f125w', 'nexp_f140w', 'nexp_f160w',
          'lmass']

flux_cols = [name for name in df.columns if name.startswith('f_F')]

positive_flux = ' & '.join([f'({col:s} > 0)' for col in flux_cols])

select_zphot = '(star_flag == 0) & (use_phot == 1) & (z_peak > 0) & (Av >= 0)'
select_zspec = '(star_flag == 0) & (z_spec > 0) & (Av >= 0)'

select = df.eval(select_zspec + '&' + positive_flux)
df_ = df[select].reset_index()
target = 'z_spec'
# features = [col for col in df_ if col not in remove + ['z_spec', 'z_peak'] ]
features = flux_cols + ['Av']

Markdown(f"""
**features** {features}

**target**: {target} """)

features [‘f_F606W’, ‘f_F814W’, ‘f_F125W’, ‘f_F140W’, ‘f_F160W’, ‘Av’]

target: z_spec

df_x = pd.concat([df_[[fk for fk in features if fk not in flux_cols]], 
                  np.log10(df_[flux_cols]).rename(columns={k: k.replace('f_', 'm_') for k in flux_cols})], 
                  axis=1)
df_y = df_[target]

df_x

	Av	m_F606W	m_F814W	m_F125W	m_F140W	m_F160W
0	0.1	0.797247	0.975064	1.086893	1.152105	1.172282
1	1.2	0.686877	0.881835	1.191004	1.271934	1.350558
2	0.6	0.997478	1.173245	1.281374	1.307410	1.333185
3	0.1	0.537932	1.090822	1.508880	1.572964	1.636698
4	0.7	1.157608	1.335598	1.491544	1.531849	1.576698
...	...	...	...	...	...	...
2287	0.6	0.310247	1.097466	1.719588	1.794063	1.881316
2288	0.6	2.120245	2.233580	2.333165	2.319252	2.425632
2289	0.6	0.960361	1.216166	1.435717	1.483602	1.520156
2290	1.4	2.104111	2.280373	2.507032	2.561006	2.613112
2291	0.7	1.553167	1.690878	1.855410	1.884246	1.910507

2292 rows × 6 columns

columns = [k for k in df_x.columns]

wrap = 6

nlines = (len(columns) + 1) // wrap +  int((len(columns) + 1) % wrap > 0)

plt.figure(figsize=(15, nlines * 4))
for e, colname in enumerate(columns, 1):
    plt.subplot(nlines, wrap, e)
    plt.hist(df_x[colname], bins=43, log=True)
    plt.xlabel(colname)

plt.subplot(nlines, wrap, e + 1)
plt.hist(df_y, bins=43, log=True, color='C1')
plt.xlabel(target)

plt.tight_layout()

../../_images/bccaab84a16b1db6047108d8fde2c6e6a96ab29a7ace9ed31151fcef36f6bfa7.png

Divide the data into train, test and validation sets#

Note that typically one needs to pass arrays to the common libraries.

X = df[features].values
y = df[target].values

However, Scikit-learn has implemented pandas interfaces, so we do not need to do so in this particular exercise.

Divide the data into train, validation and test sets. We will use the following definitions:

Training: The data used to update model parameters (e.g., coefficients or matrix element values).
Validation: The data used to update model selection (for instance, we might change hyperparameters of a model based on the validation metrics).
Testing: The data used to make final predictions, and possibly evaluate the final model score.

from sklearn.model_selection import train_test_split

# first reserve 70% of the data for training, 30% for validation
X_train, X_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.3, random_state=42)

Explore regression models#

Here we explore a few options for regressor models.

Linear Regression#

Let’s do a simple one first

from sklearn.linear_model import LinearRegression

lin = LinearRegression()
lin.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

To quantify the performance of the model, we will apply it to the validation set and compare the predicted values with the true values by computing the mean squared error:

from sklearn.metrics import mean_squared_error

y_predict = lin.predict(X_test)
y_predict0 = lin.predict(X_train)
mse_train = mean_squared_error(y_train, y_predict0)
mse_test = mean_squared_error(y_test, y_predict)

fig, ax = plt.subplots(1, 1, figsize=(4, 4))
ax.scatter(y_test, y_predict, alpha=0.2, label=f'test mse={mse_test:0.4f}')
ax.scatter(y_train, y_predict0, alpha=0.2, label=f'train mse={mse_train:0.4f}', zorder=-1)
ax.plot(ax.get_xlim(), ax.get_xlim(), zorder=-1, color='0.5')
ax.set_aspect('equal')
ax.set_xlim(-0.1, 5)
ax.set_ylim(-0.1, 5)
ax.grid()
ax.set_xlabel('Truth (y_test)')
ax.set_ylabel('Prediction (y_predict)')
plt.legend(loc='upper left')

<matplotlib.legend.Legend at 0x7fe21b00bf70>

../../_images/fae4f17e29b510ab68d287ec95178e934b1433c08d6d3a8866915af203050042.png

The MSE is not great. Not surprising given the simplicity of our model. Maybe we can do better changing the model?

importances = lin.coef_
labels = lin.feature_names_in_
argsort = np.argsort(importances)
fig = plt.figure(figsize=[10, len(labels) * 0.5])
ax = fig.add_subplot(111)
ax.barh(np.arange(len(df_x.columns)), importances[argsort], 
        align='center', 
        height=0.5, 
        tick_label=labels)
ax.set_xlabel("Lin. Coeffs")

Text(0.5, 0, 'Lin. Coeffs')

../../_images/fffd87f6b4b0710c6dc1cebc22ab18ab69faef7410d5aced9d39459a767aa471.png

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LassoLarsIC

pipe = Pipeline([
    ('polynomial', PolynomialFeatures(degree=3)),
    ('scaler', StandardScaler()),
    ('classifier', LassoLarsIC(normalize=True))
])

pipe.fit(X_train, y_train)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 8
      2 from sklearn.preprocessing import PolynomialFeatures, StandardScaler
      3 from sklearn.linear_model import LassoLarsIC
      5 pipe = Pipeline([
      6     ('polynomial', PolynomialFeatures(degree=3)),
      7     ('scaler', StandardScaler()),
----> 8     ('classifier', LassoLarsIC(normalize=True))
      9 ])
     11 pipe.fit(X_train, y_train)

TypeError: __init__() got an unexpected keyword argument 'normalize'

y_predict = pipe.predict(X_test)
y_predict0 = pipe.predict(X_train)
mse_train = mean_squared_error(y_train, y_predict0)
mse_test = mean_squared_error(y_test, y_predict)

fig, ax = plt.subplots(1, 1, figsize=(4, 4))
ax.scatter(y_test, y_predict, alpha=0.2, label=f'test mse={mse_test:0.4f}')
ax.scatter(y_train, y_predict0, alpha=0.2, label=f'train mse={mse_train:0.4f}', zorder=-1)
ax.plot(ax.get_xlim(), ax.get_xlim(), zorder=-1, color='0.5')
ax.set_aspect('equal')
ax.set_xlim(-0.1, 5)
ax.set_ylim(-0.1, 5)
ax.grid()
ax.set_xlabel('Truth (y_test)')
ax.set_ylabel('Prediction (y_predict)')
plt.legend(loc='best')

<matplotlib.legend.Legend at 0x7f22c99c42b0>

../../_images/47ebc793adf3da65b57d7b2e1556365e751064d8f4db5207d80d3e20ace69439.png

importances = pipe['classifier'].coef_
labels = pipe['polynomial'].get_feature_names_out()
argsort = np.argsort(np.abs(importances))[-10:]

importances = importances[argsort]
labels = labels[argsort]

fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)
ax.barh(np.arange(len(labels)), importances, 
        align='center', 
        height=0.5, 
        tick_label=labels)
ax.set_xlabel("Lin. Coeffs")

Text(0.5, 0, 'Lin. Coeffs')

../../_images/265c06d165b2a3d48ec62cc4ff2323f9d4ef2e92266e21e367d5a23fce5907dd.png

Other models#

Using sklearn, it is very easy to change a model for something different.

Random Forest#

Using sklearn, we can very easily construct a decision tree model for regressing redshift from the photometric catalog features. A decision tree is composed of a series of if-else decision steps. The number of steps and the types of decision at each step is determined by training the algorithm with supervision. In this first example, we will use the DecisionTreeRegressor from sklearn.

from sklearn.ensemble import RandomForestRegressor

# use the same parameters as the Decision Tree
params = {
    "min_samples_split": 15,
    "min_samples_leaf": 5,
    "max_depth": 15,
}

# Initialize the model
rf = RandomForestRegressor(**params)

rf.fit(X_train, y_train)

RandomForestRegressor(max_depth=15, min_samples_leaf=5, min_samples_split=15)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_predict = rf.predict(X_test)
y_predict0 = rf.predict(X_train)
mse_train = mean_squared_error(y_train, y_predict0)
mse_test = mean_squared_error(y_test, y_predict)

fig, ax = plt.subplots(1, 1, figsize=(4, 4))
ax.scatter(y_test, y_predict, alpha=0.2, label=f'test mse={mse_test:0.4f}')
ax.scatter(y_train, y_predict0, alpha=0.2, label=f'train mse={mse_train:0.4f}', zorder=-1)
ax.plot(ax.get_xlim(), ax.get_xlim(), zorder=-1, color='0.5')
ax.set_aspect('equal')
ax.set_xlim(-0.1, 5)
ax.set_ylim(-0.1, 5)
ax.set_xlabel('Truth (y_test)')
ax.set_ylabel('Prediction (y_predict)')
plt.legend(loc='best')

<matplotlib.legend.Legend at 0x7f22c96478b0>

../../_images/718b783cb9f2f703026a7946e100636521f63aa04809e6f24f716320630af7c2.png

The results are much improved by the random forest!

However, we must be cautious. It is possible that this model suffers from over-fitting. To visualize overfitting, we can compare the mean squared error (MSE) for models of increasing max_depth on the training and testing sets:

max_depths = np.arange(1, 20, 2).astype(int)

train_mse = []
test_mse = []

params = {
    "min_samples_split": 15,
    "min_samples_leaf": 5,
}

for depth in max_depths:

    rff_ = RandomForestRegressor(max_depth=depth, **params)
    rff_.fit(X_train, y_train)
    
    y_predict_train = rff_.predict(X_train)
    y_predict_test = rff_.predict(X_test)
    
    train_mse.append(mean_squared_error(y_train, y_predict_train))
    test_mse.append(mean_squared_error(y_test, y_predict_test))

    
fig, ax = plt.subplots(1, 1)

ax.plot(max_depths, train_mse, color='blue', label='Training Set')
ax.plot(max_depths, test_mse, color='orange', label='Test Set')
ax.set_xlabel('max depth')
ax.set_ylabel('MSE')
ax.legend()

<matplotlib.legend.Legend at 0x7f22c925edd0>

../../_images/06237c92a05504008954f4828ec25d2f0859e194acb140d8232dae152a5c7d4d.png

Beyond a max_depth of ~10, the MSE on the training set declines while the MSE on the test set flattens out, suggesting some amount of over-fitting.

To explore further, we will explore how general our model performance (here quantifed with MSE) is using k-fold cross-validation via sklearn. In practice, the X and y datasets are split into k “folds”, and over k iterations, the model is trained using k-1 folds as training data and the remaining fold as a test set to compute performace (i.e., MSE).

from sklearn.model_selection import cross_validate

cv = cross_validate(
    estimator=rf, 
    X=df_x, 
    y=df_y, 
    cv=5, # number of folds
    scoring='neg_mean_squared_error'
)

print(f'Cross-validated score: {cv["test_score"].mean():.3f} +/- {cv["test_score"].std():.3f}')

Cross-validated score: -0.156 +/- 0.089

The previous MSE is consistent with the cross-validated MSE, suggesting that the model is not significantly over-fitting.

Next, we’ll observe which features are most important to the model predictions.

importances = rf.feature_importances_
argsort = np.argsort(importances)
fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)
ax.barh(np.arange(len(df_x.columns)), importances[argsort], 
        align='center', 
        height=0.5, 
        tick_label=np.array(df_x.columns)[argsort])
ax.set_xlabel("Feature importance")

Text(0.5, 0, 'Feature importance')

../../_images/5f0f93d182aea32914c72c728f285fe731d644f183be0e0d0ba7886df23ad5a2.png

Extratrees#

from sklearn.ensemble import HistGradientBoostingRegressor

# Initialize the model
et = HistGradientBoostingRegressor()

et.fit(X_train, y_train)

HistGradientBoostingRegressor()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_predict = et.predict(X_test)
y_predict0 = et.predict(X_train)
mse_train = mean_squared_error(y_train, y_predict0)
mse_test = mean_squared_error(y_test, y_predict)

fig, ax = plt.subplots(1, 1, figsize=(4, 4))
ax.scatter(y_test, y_predict, alpha=0.2, label=f'test mse={mse_test:0.4f}')
ax.scatter(y_train, y_predict0, alpha=0.2, label=f'train mse={mse_train:0.4f}', zorder=-1)
ax.plot(ax.get_xlim(), ax.get_xlim(), zorder=-1, color='0.5')
ax.set_aspect('equal')
ax.set_xlim(-0.1, 5)
ax.set_ylim(-0.1, 5)
ax.set_xlabel('Truth (y_test)')
ax.set_ylabel('Prediction (y_predict)')
plt.legend(loc='best')

<matplotlib.legend.Legend at 0x7f22c90eb250>

../../_images/9170c9e24a3de60d3d47d199ddc56ad0007b9ea966a0b3dbea88c412e88fb561.png

cv = cross_validate(
    estimator=et, 
    X=df_x, 
    y=df_y, 
    cv=5, # number of folds
    scoring='neg_mean_squared_error'
)

print(f'Cross-validated score: {cv["test_score"].mean():.3f} +/- {cv["test_score"].std():.3f}')

Cross-validated score: -0.154 +/- 0.088

pipe2 = Pipeline([
    ('polynomial', PolynomialFeatures(degree=3)),
    ('scaler', StandardScaler()),
    ('classifier', HistGradientBoostingRegressor())
])

pipe2.fit(X_train, y_train)

Pipeline(steps=[('polynomial', PolynomialFeatures(degree=3)),
                ('scaler', StandardScaler()),
                ('classifier', HistGradientBoostingRegressor())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_predict = pipe2.predict(X_test)
y_predict0 = pipe2.predict(X_train)
mse_train = mean_squared_error(y_train, y_predict0)
mse_test = mean_squared_error(y_test, y_predict)

fig, ax = plt.subplots(1, 1, figsize=(4, 4))
ax.scatter(y_test, y_predict, alpha=0.2, label=f'test mse={mse_test:0.4f}')
ax.scatter(y_train, y_predict0, alpha=0.2, label=f'train mse={mse_train:0.4f}', zorder=-1)
ax.plot(ax.get_xlim(), ax.get_xlim(), zorder=-1, color='0.5')
ax.set_aspect('equal')
ax.set_xlim(-0.1, 5)
ax.set_ylim(-0.1, 5)
ax.set_xlabel('Truth (y_test)')
ax.set_ylabel('Prediction (y_predict)')
plt.legend(loc='best')

<matplotlib.legend.Legend at 0x7f22c915ada0>

../../_images/62ec2310648c8ed4423394f97850ae1eac30bcaaf36d3c0d07ce95110c382971.png

Check against each other and reference#

n, bins = np.histogram(y_train, bins=256)
cdf = n.cumsum()
z_cdf = 0.5 * (bins[:-1] + bins[1])
yval = np.interp(np.random.uniform(0, 1, len(y_test)),
                 cdf / cdf[-1], bins[:-1])

y_values = [yval[:], 
            lin.predict(X_test),
            pipe.predict(X_test),
            rf.predict(X_test),
            et.predict(X_test),
            pipe2.predict(X_test),
            df_['z_peak'].loc[y_test.index]
]

names = ['dummy', 'linear', 'polynomial', 'random forest', 'lightGBM', 'poly+lightGBM', 'Skelton et al. 2014']

wrap = 4

nlines = (len(columns) + 1) // wrap +  int((len(columns) + 1) % wrap > 0)

# plt.figure(figsize=(15, nlines * 4))

fig, axes = plt.subplots(nlines, wrap, figsize=(15, 4 * nlines))

for k, (ax, yval) in enumerate(zip(np.ravel(axes), y_values)):
    color = 'C0'
    if k == len(names) - 1:
        color = 'C1'
    if k == 0:
        color='C3'
    ax.scatter(y_test, yval, alpha=0.2, color=color)
    ax.set_aspect('equal')
    ax.set_xlim(-0.1, 5)
    ax.set_ylim(-0.1, 5)
    ax.plot(ax.get_xlim(), ax.get_xlim(), zorder=-1, color='0.5')
    ax.set_xlabel('Truth (z_spec)')
    ax.set_ylabel('prediction (y_predict)')
    ax.set_title(names[k])
    mse = mean_squared_error(y_test, yval, squared=False)
    ax.text(0.01, 0.9, f'RMSE = {mse:.4f}', transform=ax.transAxes)
plt.setp(np.ravel(axes)[k+1:], visible=False)

plt.tight_layout()

../../_images/17dab03222a8807365593b497102e9d80c65c7a68b141c5dc13c9ad343c31950.png

Going further#

What if you include galaxy colors (e.g., F125W-F140W, F140W-F160W, F606W-F125W)? do the model results improve? Can you think of other features to include?
Can you tune the model parameters with a grid search?
What if you try other models from sklearn?

Neural Networks#

from typing import Sequence
import torch

class linearRegressionNN(torch.nn.Module):
    def __init__(self, n_input: int, 
                 n_hidden: Sequence[int], 
                 n_output: int):
        super().__init__()
    
        self.sequence = torch.nn.Sequential()
        # add input layer
        self.sequence.add_module("input", 
                                 torch.nn.Linear(n_input, n_hidden[0]))
        # set hidden layers
        for e, (n1, n2) in enumerate(zip(n_hidden[:-1], n_hidden[1:])):
            self.sequence.add_module(f"hidden_{e:d}", 
                                     torch.nn.Linear(n1, n2))
            self.sequence.add_module(f"activation_{e:d}", 
                                     torch.nn.SELU())
        # set output layer    
        self.sequence.add_module("output",  
                                 torch.nn.Linear(n_hidden[-1], n_output))

    def forward(self, x: torch.Tensor) -> torch.Tensor:

        return self.sequence(x)

criterion = torch.nn.MSELoss()

/shared-libs/python3.10/py/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

model = linearRegressionNN(len(features), [32] * 4, 1)
optimizer = torch.optim.RAdam(model.parameters(), lr=1e-3, weight_decay=1e-4)
criterion = torch.nn.MSELoss()
epochs = 6_000

# always be careful to normalize your data.
x_scaler = StandardScaler()
X_scaled = np.squeeze(x_scaler.fit_transform(X_train))
inputs = torch.from_numpy(X_scaled.astype('float32'))
labels = torch.from_numpy(y_train.to_numpy().astype('float32').reshape(-1, 1))
inputs_test = torch.from_numpy(x_scaler.transform(X_test).astype('float32'))
labels_test = torch.from_numpy(y_test.to_numpy().astype('float32').reshape(-1, 1))

# monitoring
loss_list = []
history = [] 

for epoch in range(epochs):
    # Clear gradient buffers because we don't want any gradient 
    # from previous epoch to carry forward, dont want to cummulate gradients
    optimizer.zero_grad()

    # get output from the model, given the inputs
    outputs = model(inputs)
    outputs_test = model(inputs_test)

    # get loss for the predicted output
    loss = criterion(outputs, labels)
    loss_test = criterion(outputs_test, labels_test)
    # get gradients w.r.t to parameters
    loss.backward()

    # update parameters
    optimizer.step()

    if epoch % 100 == 0:
        print('epoch {}, loss {:2.4g} {:2.4g}'.format(epoch, loss.item(), loss_test.item()))
    history += [[epoch, float(loss), float(loss_test)]]    
    #scheduler.step()

print(f"Epoch= {epoch},\t loss = {loss:2.4g} {loss_test:2.4g}")
print('done.')

epoch 0, loss 1.256 1.194
epoch 100, loss 0.2532 0.2278
epoch 200, loss 0.2139 0.1975
epoch 300, loss 0.2004 0.1835
epoch 400, loss 0.1863 0.1683
epoch 500, loss 0.1766 0.1581
epoch 600, loss 0.1671 0.1524
epoch 700, loss 0.1582 0.1492
epoch 800, loss 0.1512 0.148
epoch 900, loss 0.1441 0.1448
epoch 1000, loss 0.135 0.1381
epoch 1100, loss 0.1252 0.1304
epoch 1200, loss 0.1153 0.1273
epoch 1300, loss 0.1071 0.1248
epoch 1400, loss 0.09918 0.1177
epoch 1500, loss 0.09319 0.1151
epoch 1600, loss 0.08701 0.1111
epoch 1700, loss 0.08287 0.1121
epoch 1800, loss 0.07684 0.1062
epoch 1900, loss 0.0748 0.102
epoch 2000, loss 0.07119 0.1002
epoch 2100, loss 0.0688 0.09899
epoch 2200, loss 0.06734 0.09871
epoch 2300, loss 0.06555 0.09761
epoch 2400, loss 0.06401 0.09789
epoch 2500, loss 0.06564 0.09763
epoch 2600, loss 0.06504 0.1043
epoch 2700, loss 0.06077 0.09725
epoch 2800, loss 0.06038 0.09661
epoch 2900, loss 0.06069 0.09704
epoch 3000, loss 0.05867 0.09635
epoch 3100, loss 0.05673 0.09808
epoch 3200, loss 0.05559 0.09761
epoch 3300, loss 0.05544 0.0981
epoch 3400, loss 0.05458 0.09742
epoch 3500, loss 0.0554 0.09712
epoch 3600, loss 0.05304 0.09779
epoch 3700, loss 0.05986 0.1105
epoch 3800, loss 0.05138 0.09794
epoch 3900, loss 0.05086 0.09789
epoch 4000, loss 0.05482 0.09978
epoch 4100, loss 0.05071 0.09806
epoch 4200, loss 0.05018 0.09988
epoch 4300, loss 0.04906 0.09885
epoch 4400, loss 0.04861 0.09787
epoch 4500, loss 0.04919 0.1017
epoch 4600, loss 0.0478 0.09805
epoch 4700, loss 0.0476 0.101
epoch 4800, loss 0.04932 0.1013
epoch 4900, loss 0.04768 0.09877
epoch 5000, loss 0.04649 0.09809
epoch 5100, loss 0.04653 0.09777
epoch 5200, loss 0.04514 0.09801
epoch 5300, loss 0.04489 0.09818
epoch 5400, loss 0.04561 0.09733
epoch 5500, loss 0.04413 0.09794
epoch 5600, loss 0.04747 0.1043
epoch 5700, loss 0.04582 0.1024
epoch 5800, loss 0.04401 0.09853
epoch 5900, loss 0.04362 0.0974
Epoch= 5999,	 loss = 0.04279 0.09795
done.

plt.loglog([e for (e, l, _) in history], 
           [float(l) for (e, l, _) in history], '-', label='train')
plt.loglog([e for (e, _, l) in history], 
           [float(l) for (e, _, l) in history], '-', label='test')
plt.legend()
plt.xlabel('epoch')
plt.ylabel('loss MSE')

Text(0, 0.5, 'loss MSE')

../../_images/bc869fb9f86b58056e07595e121e6a41ccfecc808621e576577d655b18132de8.png

y_predict = model(
    torch.from_numpy(x_scaler.transform(X_test).astype('float32'))
    ).cpu().detach().numpy()  
y_predict0 = model(
    torch.from_numpy(x_scaler.transform(X_train).astype('float32'))
    ).cpu().detach().numpy()  

mse = mean_squared_error(y_test, y_predict, squared=False)
print(f'RMSE = {mse:.4f}')

fig, ax = plt.subplots(1, 1, figsize=(4, 4))
ax.scatter(y_test, y_predict, alpha=0.2, label='test')
ax.scatter(y_train, y_predict0, alpha=0.2, label='train', zorder=-1)
ax.plot(ax.get_xlim(), ax.get_xlim(), zorder=-1, color='0.5')
ax.set_aspect('equal')
ax.set_xlim(-0.1, 5)
ax.set_ylim(-0.1, 5)
ax.set_xlabel('Truth (y_test)')
ax.set_ylabel('Prediction (y_predict)')
plt.legend(loc='best')
ax.text(0.9, 0.9, f"RMSE={mse:0.4f}", transform=ax.transAxes, ha='right');

RMSE = 0.3148

../../_images/63f5239d2d5de44507622791d3f5bb839f039cedf9bbe7cb743aed59eb5774ed.png

Predicting galaxy redshifts from 3D-HST

Contents

Predicting galaxy redshifts from 3D-HST#

Setup environment#

Notebook setup#

Predicting galaxy redshift via regression on 3D-HST photometry#

Download data#

Exploring the dataset#

Cleaning the dataset#

Divide the data into train, test and validation sets#

Explore regression models#

Linear Regression#

Other models#

Random Forest#

Extratrees#

Check against each other and reference#

Going further#

Neural Networks#