`1.022 Final Project: `_ Modeling the nation-wide spread of COVID-19 ============================================ We have carefully crafted the final project to provide you with an opportunity to gain experience dealing with a real-world application of networks. As a particular case of interest, we will design, evaluate, and refine an epidemic model to predict the **total number of covid cases in the US** over time. .. _collecting-data: =============================================== 1. Collecting and plotting data on covid cases =============================================== As the first step, we collect historic and current data on total number of covid cases across the nation via `nytimes covid-19 bot `__. Run the code below to: - get the daily record of the cumulative counts of covid cases at the state level - process the gathered data to get daily and weekly counts of new cases at both state and national levels .. jupyter-execute:: import numpy as np import pandas as pd import itertools import pickle from scipy.linalg import norm from scipy.linalg import expm from scipy.optimize import minimize import os from timeit import default_timer as timer import matplotlib.pyplot as plt plt.style.use('seaborn') import nytimes # some basic national demographics (including population of each # state to be used in normalization of covid counts) state_geo = pd.read_csv('states_demographics.csv',\ dtype={'state_fips':str}).set_index('state_id') raw_states_df, _ = nytimes.get_nyt_data() state_cases = nytimes.convert_state_df_to_ts(\ raw_states_df, quantity = 'cases').drop(columns=['state']).T state_cases.index = pd.to_datetime(state_cases.index) state_cases = state_cases[state_geo['state_name']] state_cases.columns = state_geo.index.copy() state_cases_daily = state_cases.diff() t_last_na = state_cases_daily.isna().any(axis=1).cumsum().idxmax() state_cases_daily.dropna(inplace=True) state_cases_weekly = 7*state_cases_daily.resample(\ '7D').mean().divide(state_geo['population']) cumul_offset = state_cases.loc[t_last_na].divide(state_geo['population']) state_cases_cumulative = state_cases_weekly.cumsum() + cumul_offset nation_cases_daily = state_cases_daily.sum(axis=1) nation_cases_weekly = 7*nation_cases_daily.resample('7D').mean()\ /state_geo['population'].sum() nation_cumul_offset = state_cases.loc[t_last_na].sum()\ /state_geo['population'].sum() nation_cases_cumulative = nation_cases_weekly.cumsum()\ + nation_cumul_offset .. jupyter-execute:: :hide-code: t_clip = pd.Timestamp('2020-11-25') nation_cases_cumulative = nation_cases_cumulative[:t_clip] nation_cases_weekly = nation_cases_weekly[:t_clip] state_cases_cumulative = state_cases_cumulative[:t_clip] state_cases_weekly = state_cases_weekly[:t_clip] The processed data that we will use in our design is weekly new covid cases and cumulative number of cases, gathered in DataFrames ``state_cases_weekly``, ``state_cases_cumulative`` for state level data, and as two Series ``nation_cases_weekly`` ,\ ``nation_cases_cumulative`` for the country level data, all starting at March 18 and having 7D frequency. Here is a snapshot from state level weekly count of new covid cases (`state_cases_weekly`): .. csv-table:: :header: state_id, AL, AZ, AR, CA :widths: 20, 20, 20, 20, 20 2020-03-18 , 0.000041 ,0.000042 ,0.000069, 0.000048 2020-03-25 ,0.000154 ,0.000134 , 0.000110 , 0.000150 2020-04-01 ,0.000244 ,0.000176, 0.000143 ,0.000227 **Both weekly and cumulative number of cases are normalized by the population of their states for the state level data, and by US population for the national level data** From now on and in the rest of this manuscript, *number of cases* by default refers to their values after normalization ( a number between 0 and 1), unless explicitly stated otherwise. .. _task1.1: .. topic:: Task 1.1 Use the collected data to plot: (i) weekly count of new covid cases in the US, and (ii) cumulative number of cases in the US. .. jupyter-execute:: fig, axes = plt.subplots(2,1) iter_axes = itertools.cycle(fig.axes) ax = next(iter_axes) ax.cla() ax.plot(nation_cases_weekly,\ label='national cases weekly', linewidth=4); ax.legend() ax = next(iter_axes) ax.cla() ax.plot(nation_cases_cumulative,\ label='national cases cumulative', linewidth=4); ax.legend() .. _basic-SIR: =============================================== 2. Fitting a basic SIR model to the covid data =============================================== .. _task2.1: .. topic:: Task 2.1 - Show that in the basic SIR model, so long as the size of the infection is very small (e.g., early stages of the spread), we can approximate the dynamics of the spread by: .. math:: \begin{align*}\dot{S}(t) &=-\beta~I(t)\\ \dot{I}(t) &= (\beta - \gamma)~I(t) \end{align*} - Use this simplified dynamics to show that: .. math:: S(t) \approx S(0) -\frac{\beta}{\beta-\gamma}(e^{(\beta~-~\gamma)t}-1)~I(0) Observe the three identical quantities: the (normalized) cumulative number of cases at time :math:`t` :math:`\equiv` :math:`1-S(t)` :math:`\equiv` the fraction of the population that are either infected or recovered/removed by time :math:`t` (*IR* population). The function ``SIR`` implemented below simulates :math:`T` steps of the SIR model based on the above approximation. Inputs to the ``SIR`` function: - ``beta``: the average infection/contact rate in the population - ``gamma``: the average recovery rate for each infected case - ``IR_init``: the initial size of infected-or-recovered/removed cases (:math:`I(0)+R(0)`) - ``I_init``: the initial size of the infected cases (:math:`I(0)`) - ``T``: the number of time steps to run the model. **Each step here represents a week** (7D) The function will then output: - ``IR_1_to_T``: The cumulative number of covid cases at :math:`t= 1, \ldots, T` as predicted by the model .. jupyter-execute:: def SIR(beta, gamma, IR_init, I_init, T): IR_1_to_T = (np.exp((beta-gamma)*np.arange(1,T+1))-1)*\ beta*I_init/(beta-gamma) + IR_init return IR_1_to_T The next step is to tune the parameters of the model, namely :math:`\beta` and :math:`\gamma` using the data we prepared earlier. Let us **choose March 18, as t=0**. Use the cumulative number of cases in the US on March 18 (top entry of ``nation_cases_cumulative``) as ``IR_init``. We further assume ``I_init`` = ``IR_init`` (i.e., :math:`\small R(0)\approx0`). This can be justified noting that on March 18 the spread was at its very early stages, hence only a small fraction of infected cases had recovered by then. The goal is to choose :math:`\beta` and :math:`\gamma` such that number of cases as predicted by the model *stays close* to that observed from data. The standard approach is to choose a *loss function*, measuring how off the predictions are from observed data, and then tune the parameters such that they minimize the loss. | Here, we search for :math:`\beta` and :math:`\gamma` minimizing the sum of the squares of prediction errors for the US cumulative number of cases, over a period of :math:`T=16` weeks (from March 25 to July 08) | .. math:: {\underset{\beta, \gamma}{\rm minimize}} \sum_{t=1}^T \big(IR_{\rm predicted}(t)-IR_{\rm data}(t)\big)^2 Note that March 18 data is already used in initializing the model (``IR_init``). Here is the code implementing this loss function: .. jupyter-execute:: def SIR_LOSS(params, IR_init, I_init, observed_cumulative_cases): beta, gamma = params T = len(observed_cumulative_cases) predicted_IR = SIR(beta, gamma, IR_init, I_init, T) loss = norm(observed_cumulative_cases-predicted_IR)**2 return loss Inputs and outputs to ``SIR_LOSS`` function: - ``params``: to pass the values of :math:`\beta` and :math:`\gamma` to the loss function. - ``IR_init``: The initial size of the infected-or-recovered cases (:math:`I(0)+R(0)`) - ``I_init``: The initial size of the infected cases (:math:`I(0)`) - ``observed_cumulative_cases``: The cumulative number of covid cases in the US as observed from data The function will then output: - ``loss``: The sum of squares of prediction errors in cumulative number of cases in the US as predicted by the model .. _task2.2: .. topic:: Task 2.2 - Use the function `minimize `__ from *scipy.optimize* package to find values of :math:`\beta` and :math:`\gamma` minimizing the loss function ``SIR_LOSS`` for a period of :math:`T=16` weeks from March 25 to July 08. (Choose data from ``nation_cases_cumulative`` for this period to pass as ``observed_cumulative_cases`` to the loss function) - Pass the values :math:`\beta` and :math:`\gamma` you obtain to the ``SIR`` function to predict the cumulative number of covid cases in the US for 36 weeks starting from March 18. - Plot the predictions together with observed data for both (i) cumulative number of cases and (ii) weekly new cases ( use ``np.diff()`` to obtain the latter) .. jupyter-execute:: n_train = 16 IR_init = I_init = nation_cases_cumulative[0] observed_cumul_cases = nation_cases_cumulative[1:n_train+1].values params_init = [0.1,0.01] res = minimize(SIR_LOSS, params_init,\ args = (IR_init, I_init, observed_cumul_cases)) beta_SIR, gamma_SIR = res['x'] print('success:{}, beta: {:.3f}, gamma:{:.3f}'.format(\ res['success'], beta_SIR, gamma_SIR)) .. jupyter-execute:: # pred over entire data/visualize T = len(nation_cases_cumulative[1:]) predicted_cumul_cases = SIR(beta_SIR, gamma_SIR,\ nation_cases_cumulative[0], nation_cases_cumulative[0], T) predicted_cumul_cases = \ np.append(nation_cases_cumulative[0],predicted_cumul_cases) predicted_weekly_cases = np.diff(predicted_cumul_cases) fig, axes = plt.subplots(2,1) iter_axes = itertools.cycle(fig.axes) ax = next(iter_axes); ax.cla() ax.plot(nation_cases_weekly[1:], \ label= '{}: weekly'.format('nation'), linewidth=2) ax.legend() ax.plot(nation_cases_weekly.index[1:],\ predicted_weekly_cases, label='pred', linewidth=2) ax.legend() ax = next(iter_axes); ax.cla() ax.plot(nation_cases_cumulative,\ label='{}: cumulative'.format('nation'), linewidth=2) ax.legend() ax.plot(nation_cases_cumulative.index,\ predicted_cumul_cases, label='pred', linewidth=2) ax.legend() ==================================================================================================== 3. Refining the design by incorporating `SafeGraph `__ mobility data ==================================================================================================== The performance of the basic SIR design reveals the necessity to account for the behavioral dynamics of a population dealing with a pandemic, in order to have more reliable predictions. In this part, we refine our initial design by incorporating mobility data provided by `SafeGraph `__ in order to cope with changes in inter-state mobility across the US in the covid era. To save you time, we have already processed and trimmed daily US mobility data starting from January 1st up until now. The end result accessible via ‘interstate_mobility.pickle’ is a dictionary providing weekly averaged inter-state mobility data for the period of January 1 - November 11 (a total of 46 weeks). Each item in the dictionary is a tuple ``(date, df)``, where: - ``df.loc[st1, st2]`` counts the number of devices from state ``st1`` who visited state ``st2``, averaged over the week starting from ``date``. - diagonal entries of ``df`` count the number of devices not leaving their home state, averaged over the same week. .. jupyter-execute:: with open('interstate_mobility.pickle', 'rb') as f_handle: interstate_mobility_weekly = pickle.load(f_handle) # let's look inside date = pd.Timestamp('2020-03-18') pd.options.display.float_format = '{:.2f}'.format display(interstate_mobility_weekly[date].iloc[:4,:6]) .. _task3.1: .. topic:: Task 3.1 Calculate and plot average out-of-state travel by adding up off-diagonal entries in each DataFrame and normalizing the result by the number of states. Is there any noticable fluctuations in travel volume in response to covid? Summarize your observations from the plot (max. 5 lines). .. jupyter-execute:: avg_inflow ={} for date, df in interstate_mobility_weekly.items(): avg_inflow[date] = (np.sum(df.values) - np.trace(df))/df.shape[0] avg_inflow = pd.Series(avg_inflow) plt.plot(avg_inflow,\ label='average out of state travel', linewidth=4) plt.legend() A networked SIR model with time-varying inter-state mobility ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is an enhanced version of the networked SIS model discussed in the lecture, capturing the effect of changes in inter-state mobility over time. The population of any state :math:`j` at time :math:`t` is compromised of (i) those who have travelled to state :math:`j` from other states, and (ii) those whose home state is :math:`j` and have not travelled elsewhere. We use :math:`n_{ij}(t)` to denote the number of individuals with home state :math:`i` who are in state :math:`j` at time :math:`t`. The infected and susciptible populations currently at state :math:`j` are then :math:`\sum_{i=1}^{n}n_{ij}(t)I_i(t)` and :math:`\sum_{i=1}^{n}n_{ij}(t)S_i(t)`. Normalized sizes become: .. math:: \begin{align} {\hat I}_j(t)=&\sum_{i=1}^{n}c_{ij}(t)I_i(t)\\ {\hat S}_j(t)=&\sum_{i=1}^{n}c_{ij}(t)S_i(t), \end{align} where :math:`c_{ij}(t):=\frac{n_{ij}(t)}{\sum_{i=1}^{n}n_{ij}(t)}`. Observe that :math:`c_{ij}(t)` **normalizes the inflow** to state :math:`j` to 1. In other words, matrix :math:`C(t)=[c_{ij}(t)]` is *column-stochastic*. The size of new infections is :math:`\beta_j {\hat I}_j(t){\hat S}_j(t)`. During the early stages, we can use the first order approximation :math:`\beta_j {\hat I}_j(t)` for the size of new infections, out of which the share of travellers from state :math:`i` is :math:`\beta_j {\hat I}_j(t) n_{ij}(t)` newly infected individuals, taking the virus with themselves back to their home state :math:`i`. The aggregate size of new cases among **residents of state** :math:`{\bf i}` thus becomes :math:`\sum_{j=1}^{n}\beta_j {\hat I}_j(t) n_{ij}(t)`, or equivalently :math:`\sum_{i=j}^{n}\beta_j {\hat I}_j(t) \rho_{ij}(t)`, when normalized with their population size :math:`\sum_{j=1}^{n} n_{ij}(t)`. Note that, here :math:`\rho_{ij}(t) :=\frac{n_{ij}(t)}{\sum_{j=1}^{n}n_{ij}(t)}`. That is, :math:`\rho_{ij}(t)` **normalizes the outflow** from state :math:`i` to 1. In other words, matrix :math:`{\rm P}(t)=[\rho_{ij}(t)]` is *row-stochastic*. The continuous-time approximate update rule for state :math:`i` then becomes: .. math:: \begin{align*} \dot{S}_i{(t)} &= -\sum_{j=1}^{n}\rho_{ij}(t)\beta_j {\hat I}_j(t)\\ \dot{I}_i(t)&=\sum_{j=1}^{n}\rho_{ij}(t)\beta_j {\hat I}_j(t) - \gamma I_i(t) \end{align*} .. _task3.2: .. topic:: Task 3.2 - Verify that the dynamics described above can be written in the following compact form: .. math:: \begin{align*} \dot{S}(t) &= -{\rm P}(t)~{\rm diag}(\beta)~C^T(t)~I(t)\\ \dot{I}(t) &= \big({\rm P}(t)~{\rm diag}(\beta)~C^T(t)-\gamma~{\rm Id.}\big)~I(t) \end{align*} where :math:`{\rm P}(t) = [\rho_{ij}(t)]` and :math:`C(t)=[c_{ij}(t)]` are row-stochastic and column-stochastic versions of the inter-state mobility matrix, respectively. - Can you justify the use of different contact rates, but a common recovery rate for different states? The solution to the above continuous-time dynamics can be written as: .. math:: \begin{align*} I(t) &= e^{\int_0^{t}\big({\rm P}(\tau)~{\rm diag}(\beta)~C^T(\tau)-\gamma~{\rm Id.}\big)d\tau}~I(0)\\ S(t) &= S(0) -\int_0^{t} {\rm P}(\tau)~{\rm diag}(\beta)~C^T(\tau)~I(\tau)~d\tau \end{align*} where :math:`{\rm Id.}` stands for the *identity matrix*. We use the following discrete-time approximation in our analysis: .. math:: \begin{align*} I(t+1)&=e^{\sum_{\tau=0}^{t}\big({\rm P}(\tau)~{\rm diag}(\beta)~C^T(\tau)-\gamma~{\rm Id.}\big)}~I(0)\\ S(t+1) &= S(0) - \sum_{\tau=0}^{t} {\rm P}(\tau)~{\rm diag}(\beta)~C^T(\tau)~I(\tau) \end{align*} The function ``Networked_SIR`` implemented below simulates :math:`T` steps of our networked SIR model based on the above approximation. Inputs to the ``Networked_SIR`` function: - ``beta``: 1D array whose elements are the infection rates of different states - ``gamma``: the common recovery rate - ``IR_init``: 1D array represeting the initial size of the IR population in each state - ``I_init``: 1D array represeting the initial size of the infected population in each state - ``C_matrices``: :math:`T\times n_{\rm states}\times n_{\rm states}` array. ``C_matrices[t]`` represents inter-state mobility matrix with inflows normalized to 1 (column sum=1) at time :math:`t` for :math:`t=0,1,\ldots, T-1` - ``P_matrices``: :math:`T\times n_{\rm states}\times n_{\rm states}` array. ``P_matrices[t]`` represents inter-state mobility matrix with outflows normalized to 1 (row sum=1) at time :math:`t` for :math:`t=0,1,\ldots, T-1` - ``T``: the number of time steps to simulate the model and make predictions for :math:`t=1,\ldots, T`. Each step again represents a week (7D) The function will then output: - A tuple (``IR``, ``I``) of two :math:`T\times n_{\rm states}` arrays giving model predictions for size of IR and I populations in each state for :math:`t= 1, \ldots, T` .. jupyter-execute:: def Networked_SIR(beta, gamma, IR_init, I_init,\ C_matrices, P_matrices): n_states = len(I_init) T = len(C_matrices) I = np.zeros((T+1,n_states)) IR = np.zeros((T+1,n_states)) I[0], IR[0] = I_init.copy(), IR_init.copy() # I[0], IR[0] = I_init[:,0].copy(), IR_init[:,0].copy() for t in range(T): P_beta_CT_t = P_matrices[t].dot(\ np.diag(beta).dot(C_matrices[t].T)) I[t+1] = np.dot(expm(\ P_beta_CT_t - gamma*np.eye(n_states)),I[t]) IR[t+1] = IR[t] + np.dot(P_beta_CT_t,I[t]) return IR[1:], I[1:] Here, we search for :math:`\beta = [\beta_1,\ldots,\beta_{n_{\rm states}}]` and :math:`\gamma` minimizing the following loss: .. math:: {\underset{\beta, \gamma}{\rm minimize}} \sum_{t=1}^T \sum_{i=1}^{n_{\rm states}}\frac{\big(\text{cumulative cases from data}(i,t) - \text{predicted cumulative cases}(i,t)\big)^2}{n_{\rm states}\times T} This is the average of the losses of individual states’ predictions for the **raw number of cases** and not the normalized counts (can you guess why we do not use normalized counts here?). Here is the code implementing this loss function: .. jupyter-execute:: def Networked_SIR_LOSS(params, IR_init, I_init,\ C_matrices, P_matrices, observed_cumul_cases): beta, gamma = params[:-1], params[-1] n_states = len(I_init) T = len(P_matrices) predicted_IR, predicted_I = Networked_SIR(beta, gamma,\ IR_init, I_init, C_matrices, P_matrices) error_in_cumul_cases = (observed_cumul_cases-predicted_IR)\ *state_geo['population'].values # normalizing the error with number of states and T. # The devision by 1000 is just for the error not to blow up! loss = norm(error_in_cumul_cases/(n_states*T*1000))**2 return loss Inputs and outputs to ``Networked_SIR_LOSS`` function: - ``params``: 1D array to pass model parameters :math:`\beta` and :math:`\gamma` to the loss function. - ``IR_init``: the initial size of infected-or-recovered cases for each state - ``I_init``: the initial size of infected cases for each state - ``C_matrices``: :math:`T\times n_{\rm states}\times n_{\rm states}` array. ``C_matrices[t]`` represents inter-state mobility matrix with inflows normalized to 1 (column sum=1) at time :math:`t` for :math:`t=0,1,\ldots, T-1` - ``P_matrices``: :math:`T\times n_{\rm states}\times n_{\rm states}` array. ``P_matrices[t]`` represents inter-state mobility matrix with outflows normalized to 1 (row sum=1) at time :math:`t` for :math:`t=0,1,\ldots, T-1` - ``observed_cumul_cases``: :math:`T\times n_{\rm states}` array passing the observed cumulative count of covid cases in each state for :math:`t=1,\ldots, T`. **Note that we still pass the normalized value of observed_cumul_cases to the loss function. The conversion to raw numbers is implemented inside the function.** The function will then output: - ``loss``: The sum of squares of prediction errors in cumulative case counts at state level, averaged over all the states .. topic:: Task 3.3 - Use the function `minimize `__ from *scipy.optimize* package to find values of :math:`\beta=[\beta_1,\ldots,\beta_{n}]` and :math:`\gamma` minimizing the loss function ``Networked_SIR_LOSS`` over a period of :math:`T=16` weeks, from March 25 to July 08. (Choose data from ``state_cases_cumulative`` for this period to pass as ``observed_cumul_cases`` to the loss function) - make a bar plot of :math:`\beta` values versus their state ids (use ``state_geo.index`` for state ids) - Pass the values obtained for :math:`\beta` and :math:`\gamma` to ``Networked_SIR`` function and use its output to predict the cumulative number of covid cases in the US for 36 weeks starting from March 18. - Plot the predictions together with observed data for both (i) cumulative number of cases and (ii) weekly new cases ( use ``np.diff()`` to obtain the latter) .. jupyter-execute:: t_init = state_cases_cumulative.index[0] datelist = avg_inflow[t_init:].index C_matrices = np.array([interstate_mobility_weekly[date].div(\ interstate_mobility_weekly[date].sum(axis=0), axis=1).values\ for date in datelist]) P_matrices = np.array([interstate_mobility_weekly[date].div(\ interstate_mobility_weekly[date].sum(axis=1), axis=0).values\ for date in datelist]) IR_init = I_init = state_cases_cumulative.head(1).values[0] n_states = len(I_init) beta, gamma = beta_SIR*np.ones(n_states), gamma_SIR params_init = np.append(beta,gamma) .. jupyter-execute:: start_time = timer() res = minimize(Networked_SIR_LOSS, params_init, args = (IR_init, I_init,\ C_matrices[:n_train],P_matrices[:n_train],\ state_cases_cumulative[1:1+n_train].values)) elapsed = timer() - start_time print('elapsed time: {:.2f} seconds, success:{}'.format(\ elapsed, res['success'])) beta, gamma = res['x'][:-1], res['x'][-1] .. jupyter-execute:: plt.figure(1, figsize=(18,6)) plt.bar(state_geo.index, beta, label='beta'); plt.legend() .. jupyter-execute:: predicted_IR, predicted_I = Networked_SIR(beta, gamma, IR_init, I_init,\ C_matrices,P_matrices) population_weights = state_geo['population'].div(\ state_geo['population'].sum()) US_predicted_IR = (predicted_IR*population_weights.values).sum(axis=1) predicted_weekly_cases = np.diff(\ np.concatenate([IR_init.reshape(1,-1), predicted_IR]), axis=0) US_predicted_weekly_cases = \ (predicted_weekly_cases*population_weights.values).sum(axis=1) n_preds = len(US_predicted_weekly_cases) fig, axes = plt.subplots(2,1) iter_axes = itertools.cycle(fig.axes) ax = next(iter_axes) ax.cla() ax.plot(nation_cases_weekly[:n_preds+1],\ label='US weekly cases', linewidth=4) ax.plot(nation_cases_weekly.index[1:n_preds+1],\ US_predicted_weekly_cases, label='predicted',\ linewidth=2); ax.legend() ax = next(iter_axes) ax.cla() ax.plot(nation_cases_cumulative[:n_preds+1],\ label='US cumulative cases', linewidth=4); ax.legend() ax.plot(nation_cases_cumulative.index[1:n_preds+1],\ US_predicted_IR, label='predicted', linewidth=2); ax.legend() Model-based prediction of the efficacy of local state regulations and social practices -------------------------------------------------------------------------------------- Local and state-level policies and regulations, such as stay-at-home orders and working from home, as well as social practices such as social distancing and public mask-wearing, can significantly affect the local infection/contact rates. As the last step of this project, we will learn how to use our model to estimate the impact of improving infection/contact rates on the size of infection. .. topic:: Task 4.1 Suppose that by means of local regulations and promoting social awareness you could reduce infection rates from :math:`\beta=[\beta_1,\ldots,\beta_n]` to :math:`c\,\beta` for some :math:`0