data scraping, statistical modelling,

Coronavirus Data Analysis

Penny Penny Mar 14, 2020 · 6 mins read
Coronavirus Data Analysis

Subtitle: COVID19 Spread SIR Model Simulation using Python

Since the President declared a national emergency concerning the coronavirus disease, the world has been thrust into an unprecedented era marked by widespread uncertainty and profound challenges. The global response to the COVID-19 pandemic has emphasized the critical need for predictive models to understand and mitigate the spread of infectious diseases.

The Susceptible-Infected-Recovered (SIR) model is commonly used in epidemiology to understand and predict the spread of infectious diseases within a population.

In this context, the Susceptible-Infected-Recovered (SIR) model has emerged as a valuable tool in epidemiological research, offering insights into the dynamics of contagion within populations.

Since the President declared a national emergency concerning the coronavirus disease, several months have been passed since the start of quarantine. The spread of COVID 19 seems going worse as more and more colleagues and friends are getting sick.

In this time huge changes have occurred since I first started this blog in early February. My initial goal was to create a source that would collect total number of cases worldwide since the most updated information was not easily available to the public. However, the current situation has now changed, and there are more reliable resources that are documenting the spread of COVID-19 and disseminating the research in clear graphs. However, if you like to try my code, I have presented it below (output is in json format).

def predict(c):
	post_url = "https://covid19.who.int/page-data/region/wpro/country/cn/page-data.json"
	try:
		res = re.get(post_url, timeout=10)
		status_code = res.status_code
	except re.exceptions.ConnectionError:
		status_code = 'CONNECTION ERROR'
	if status_code != 200:
		print ('___'*20, status_code)
		return 'STATUS ERROR'
	
	json_data = json.loads(res.json()['result']['data']['countryGroup']['data'])
	cols = ['date','_']+[x['name'] for x in json_data['metrics']]
	df = pd.DataFrame(data=json_data['rows'], columns=cols)
	df['date'] = pd.to_datetime(df['date'],utc=True,unit='ms').dt.date
	df.drop(columns=['_'],inplace=True)	
#	print (df.describe().T) 
#	print (df.loc[df['Deaths']<0]) # abnormal data neg daily death 
	df = df.loc[df['Deaths']>0]
	print (df)
 
predict(c='china')

Although the situation with the coronavirus seems to be under control in China, the number of individuals infected in the US is rapidly increasing. As of March 31st 2020, the confirmed cases in the US reached around 188,530 and some officials are predicting that the number of deaths may reach up to 240,000.

corona_tot

Figure 1 (a). Total confirmed cases in US, China, Italy (b). Log scale

In Figure 1, we can observe the progression of the coronavirus outbreak in three countries, each in a different phase. China has managed to control the spread of the virus, with the number of cases stabilizing after day 30. Italy’s rate of infection has slowed down and could potentially follow China’s trend (as indicated by the green line). In the US, we are still in the midst of a growing phase, but there is a glimmer of hope as further calculations suggest a possible slowdown (as indicated by the tip of the blue line). Figure 1b, which uses a log scale, demonstrates that the US and China have a higher infection rate due to their high population density. The slope of the graph in Figure 1b is calculated every 14 days using linear regression, and is obtained by breaking the data into smaller sections and taking the maximum slope.

Analysis

Figure 1(a) shows that US data (blue line) probably follows an exponential trend ain the early stage, as also shown in the log Figure 1(b). Whereas, the green curve (China) looks more like a sigmoid function. So let’s first try to predict the infection population I(t) with a simple exponential model, defining I as infection population with an assumption that each patient infects A number of new people every day. This can be tested by by plotting log(y) against t - Figure 1(b):

\[\begin{equation} I = I_{0}\times A^{t} \end{equation}\]

where t is number of days since \(I_{0}\) cases.

This model only fits the blue line, and does not fit the trend of the yellow or green line at later stage. After doing some research on spread of infectious diseases, it seems an SIR-model might be a better model. The model consists of three parts: infected population, susceptible population and recovered/immune population. As I am trying to model the total of confirmed cases, not the active cases, for simplicity I will I can drop the recovery aspect of the model. This greatly simplifies the math, allowing to simply solve an ordinary differential equation. To further simplify this, let’s define population as 1 and I(t) is the percentage of population that is infected and S(t) is the proportion of susceptible population. Then we get:

\[\begin{equation} S(t)+I(t) = 1 \end{equation}\]

Define \(\beta\) as transmission rate. Upon further reflection, the rate of infection should be affected by both infection and susceptible population. To consider two extreme cases,

  • when s = 1 , i = 0, the transmission rate, β = 0
  • when s = 0 , i = 1, everyone is infected and no more new cases, β=0

If we divide both side by \(\delta t\), we can get:

\[\frac{dI}{dt} = - \frac{dS}{dt} = \beta S I\]

This is equation is an ODE function, and we can solve it using scipy package.

In fact, turns out that this function \(\begin{equation} \frac{dI}{dt} = \beta I \cdot (1-I) \end{equation}\)

is the derivate of sigmoid function

\[\begin{equation} I = \frac{1}{1+e^{-t}} \end{equation}\]

Prediction using sigmoid function

The model predicts the total confirmed cases next 30 days. For example, given data on March 1st, the model estimates about 172172 confirmed cases in Italy at the end of April.

Thanks for reading !

Penny
Written by Penny
Data scienctist, love to explore new ideas