BeABee was an art and science project conducted by a group in Switzerland. Its aim was to identify new patterns in bee sound data and possibly map the measurements with labels regarding the state of the bee population. The data was also used for an auditive art installation.
BeABee made their data available to us and we performed a Fast Fourier Transform (FFT) on the sound files. In this document, we present some exploratory results. Mainly, we investigate how the frequency of the sounds change over time and how it correlates with the weather data.
BeABee installed 6 microphones:
Additionally, they used piezo contact microphones, which have not been considered in our analysis yet.
The FFT was performed on 10-minute bins. For the calculation of the mean, median and dominant frequency we did not employ any filter and the high frequency resolution is remained. We saved the mean of 20 Hz frequency bins in a range between 0 and 1000 Hz. The code can be found in fft/.
For a first analysis, we used some exploratory methods, a multiple linear regression model and a linear regression with a correlation structure (Generalised Least Squares). So far, we focus on only one microphone, the microphone on the left side in the bottom of the hive.
The weather data can be accessed under http://beabee.ch/forschungsdaten.html and in this repository under meteoNew.csv.
The original audio data has a size of approximately 2.8 TB and is therefore not published.
The output of the FFT comes in aggregated form. It contains the mean for 20 Hz frequency bins (columns X0 - X980), the mean frequency (mean), its standard deviation (sd) and standard error (sem), the median frequency (median), the dominant frequency (mode), the channel, a timestamp and the filename.
Here are the first 3 entries for the highest frequency bins and the further columns:
df <- read.table("aggregatedBeABeeAudio.csv", header = T, sep=",")
library(knitr)
kable(head(df[,49:58], n = 3))
X960 | X980 | mean | sd | sem | median | mode | channel | timestamp | file |
---|---|---|---|---|---|---|---|---|---|
0.0033774 | 0.0023070 | 3710.801 | 5224.228 | 1.436290 | 1246.955 | 50.03667 | 2 | 2013-05-01 00:30:00 | mik innen-aussen-01-may-00-30.wav |
0.0035939 | 0.0035109 | 3244.735 | 4944.609 | 1.359415 | 1112.138 | 50.02500 | 2 | 2013-05-01 00:40:00 | mik innen-aussen-01-may-00-30.wav |
0.0052965 | 0.0072284 | 3535.596 | 5200.648 | 1.429807 | 1135.558 | 50.01834 | 2 | 2013-05-01 00:50:00 | mik innen-aussen-01-may-00-30.wav |
We create a script to disaggregate the data and to merge it with the weather data. It is hidden behind the “code” button.
## This is a script to read the aggregated beabee-audio data and to disaggregate it
#### read aggregated data
df <- read.table("aggregatedBeABeeAudio.csv", header = T, sep=",")
# create a type column, identifying the mice
df$type <- substr(df$file,1,11)
unique(df$type)
##### disaggregate. You can choose an other column than the mean frequency
library(reshape)
disagg <- cast(df, timestamp~type+channel, value="mean")
nrow(disagg)
names(disagg)
#### combine with weather data
weather <- read.table("meteoNew.csv", header = T, sep=",")
#### get the timestamp in the correct format
# for disagg, at midnight "00:00:00" is missing
disagg$timestamp <- as.character(disagg$timestamp)
disagg$timestamp <- ifelse(nchar(disagg$timestamp) <= 10,yes = paste(disagg$timestamp, "00:00:00"), no = disagg$timestamp)
# for weather, it sometimes has the form 7:00:00 instead of 07:00:00 (as example)
weather$timestamp <- paste(weather$Date, weather$Time)
weather$timestamp <- as.character(as.POSIXlt(weather$timestamp, format="%m/%d/%Y %H:%M:%S"))
#### merge data
merged <- merge(disagg, weather, by="timestamp", all = T)
# controll which timestamps are missing
head(disagg[!(disagg$timestamp %in% weather$timestamp),], n = 50)
missingweather <- merged[is.na(merged$Rain),]
nrow(missingweather)
nrow(merged)
# save the data
write.csv(merged, file= "mergedmean.csv", sep=",", quote = F, row.names = F)
The most important line of the script is
disagg <- cast(df, timestamp~type+channel, value="mean")
Here the data gets disaggregated, with “mean” being the only measurement retained for all microphones.
The script results in data in the following format:
require(lattice)
require(ggplot2)
source("cor.R")
merged <- read.csv("mergedmean.csv")
library(knitr)
kable(head(merged[15:18,], n = 3))
timestamp | mik.innen.a_1 | mik.innen.a_2 | miks.boden._1 | miks.boden._2 | miks.kanal._1 | miks.kanal._2 | Date | Time | HumidityHive | TemperatureHive | WatervapourdenHive | TemperatureOutside | Barometer | HumidyOutside | Rain | WindSpeed | WindDir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | 2013-04-13 11:00:00 | NA | NA | 1327.220 | 1394.588 | NA | NA | 4/13/2013 | 11:00:00 | 100.00 | 10.14 | 9.471 | 9.05556 | 1022.35 | 76 | 0 | 3.22 | 202 |
16 | 2013-04-13 11:10:00 | NA | NA | 1929.673 | 2068.534 | NA | NA | 4/13/2013 | 11:10:00 | 79.99 | 22.38 | 15.858 | 9.27778 | 1022.46 | 75 | 0 | 1.61 | 180 |
17 | 2013-04-13 11:20:00 | NA | NA | 1383.662 | 1537.951 | NA | NA | 4/13/2013 | 11:20:00 | 59.75 | 24.23 | 13.163 | 10.05560 | 1022.52 | 75 | 0 | 0.00 | 225 |
where mik.innen.a_1 is within the hive, positioned at the center of the floor, mik.innen.a_2 is outside of the hive miks.boden._1 is at the bottom-left side, within the hive miks.boden._2 is at the bottom-right side, within the hive miks.kanal._1 is on the right side of the entree canal miks.kanal._2 is on the left side of the entree canal.
We decided to remove temperature measurements under 27.5 °C, as we suspect they only occured when the sensor was just installed.
# remove value
#nonn <- merged[!is.na(merged$mik.innen.a_1),]
#merged <- merged[-3789,]
# nothing can be heard in the audio, let's remove it
#hist(merged$TemperatureHive)
merged <- merged[-which(merged$TemperatureHive<27.5),]
Let us just look at some of the relationships between measurements. We find many statistically significant relationships (significance level indicated by stars, 3 stars is the best).
pairs(merged[, 10:12], lower.panel=panel.smooth, upper.panel=panel.cor)
pairs(merged[, c(2, 4, 13, 15)], lower.panel=panel.smooth, upper.panel=panel.cor)
Especcialy, let us look at the relationships between our dependent and the independent variables. We converted Time and Date into numerical variables by matching each date/ time with an index. For some covariates is hard to say, if a linear relationship is justified, for example wind speed, which is not measured continously either. For others, like temperature outside, there is quite an obvious relation. Looking at rain, it might be a good idea to transform this to a factor variable with a “rain” and a “no rain” option.
coefs <- 10:20
ab <- 4
merged$DateNum <- as.numeric(merged$Date)
merged$TimeNum <- as.numeric(merged$Time)
for(i in 1:length(coefs)){
plot(merged[,coefs[i]], merged[,ab], xlab=names(merged)[coefs[i]],
ylab=names(merged)[ab])
}
For some of the covariates it is hard to say, if there is a linear relationship. A Multiple Linear Regression therefore might no be the best choice and results are to be handled if caution, but it can give us as a good first impression.
We start with a model considering all covariates.
audio <- lm(miks.boden._1 ~ DateNum+ TimeNum+ HumidityHive+ TemperatureHive+ WatervapourdenHive+ TemperatureOutside+ Barometer+ HumidyOutside+ Rain+ WindSpeed+ WindDir , data=merged)
summary(audio)
##
## Call:
## lm(formula = miks.boden._1 ~ DateNum + TimeNum + HumidityHive +
## TemperatureHive + WatervapourdenHive + TemperatureOutside +
## Barometer + HumidyOutside + Rain + WindSpeed + WindDir, data = merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -565.91 -88.77 -4.48 73.64 550.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.496e+03 1.305e+03 -4.213 2.61e-05 ***
## DateNum -7.256e+00 2.176e-01 -33.346 < 2e-16 ***
## TimeNum 3.645e-01 7.673e-02 4.750 2.16e-06 ***
## HumidityHive 1.576e+00 8.841e+00 0.178 0.8586
## TemperatureHive -3.707e+01 3.280e+01 -1.130 0.2585
## WatervapourdenHive 1.902e+00 2.548e+01 0.075 0.9405
## TemperatureOutside 1.461e+01 1.326e+00 11.014 < 2e-16 ***
## Barometer 7.766e+00 7.088e-01 10.958 < 2e-16 ***
## HumidyOutside 1.041e+00 4.099e-01 2.540 0.0111 *
## Rain 1.191e+02 3.748e+01 3.178 0.0015 **
## WindSpeed 2.526e+00 9.856e-01 2.562 0.0105 *
## WindDir 3.793e-02 2.634e-02 1.440 0.1500
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.1 on 2319 degrees of freedom
## (7590 observations deleted due to missingness)
## Multiple R-squared: 0.6086, Adjusted R-squared: 0.6067
## F-statistic: 327.8 on 11 and 2319 DF, p-value: < 2.2e-16
Before we use this modell, we should check for collinearity between covariates. Collinearity makes the model unstable and the standard errors for the estimates might be very high. We can detect collinearity using variance inflation factors (VIFs):
library(car)
## Loading required package: carData
vif(audio)
## DateNum TimeNum HumidityHive
## 1.652567 1.196500 517.771899
## TemperatureHive WatervapourdenHive TemperatureOutside
## 89.249233 515.791478 4.478465
## Barometer HumidyOutside Rain
## 1.609649 4.865273 1.116961
## WindSpeed WindDir
## 1.216154 1.124947
There is no clear rule which tells us, if a VIF is good or to high. Usually, a VIF should not be higher than 5 or 10. We therefore decided to remove WatervapourdenHive from our model:
audioVif <- lm(miks.boden._1 ~ DateNum+ TimeNum+ HumidityHive+ TemperatureHive+ TemperatureOutside+ Barometer+ HumidyOutside+ Rain+ WindSpeed+ WindDir, data=merged)
summary(audioVif)
##
## Call:
## lm(formula = miks.boden._1 ~ DateNum + TimeNum + HumidityHive +
## TemperatureHive + TemperatureOutside + Barometer + HumidyOutside +
## Rain + WindSpeed + WindDir, data = merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -565.97 -88.76 -4.52 73.43 550.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.577e+03 7.312e+02 -7.628 3.47e-14 ***
## DateNum -7.254e+00 2.148e-01 -33.767 < 2e-16 ***
## TimeNum 3.646e-01 7.668e-02 4.755 2.11e-06 ***
## HumidityHive 2.235e+00 4.482e-01 4.986 6.61e-07 ***
## TemperatureHive -3.464e+01 4.062e+00 -8.527 < 2e-16 ***
## TemperatureOutside 1.461e+01 1.326e+00 11.016 < 2e-16 ***
## Barometer 7.768e+00 7.082e-01 10.968 < 2e-16 ***
## HumidyOutside 1.040e+00 4.091e-01 2.541 0.01113 *
## Rain 1.189e+02 3.733e+01 3.185 0.00147 **
## WindSpeed 2.528e+00 9.846e-01 2.568 0.01029 *
## WindDir 3.786e-02 2.631e-02 1.439 0.15036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.1 on 2320 degrees of freedom
## (7590 observations deleted due to missingness)
## Multiple R-squared: 0.6086, Adjusted R-squared: 0.6069
## F-statistic: 360.7 on 10 and 2320 DF, p-value: < 2.2e-16
We can look at the VIFs again:
vif(audioVif)
## DateNum TimeNum HumidityHive
## 1.611198 1.195549 1.331180
## TemperatureHive TemperatureOutside Barometer
## 1.369814 4.478452 1.607939
## HumidyOutside Rain WindSpeed
## 4.849861 1.108363 1.214228
## WindDir
## 1.123512
Now, the VIF values are looking good.
Looking at AIC values, it does not matter if we keep WindDirection in the model. All other covariates seem to be significant! This can be seen from the summary output above.
step.linearAllAIC<- step(audioVif, direction="both")
## Start: AIC=23084.36
## miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +
## TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +
## WindDir
##
## Df Sum of Sq RSS AIC
## <none> 46168844 23084
## - WindDir 1 41193 46210037 23084
## - HumidyOutside 1 128468 46297313 23089
## - WindSpeed 1 131237 46300082 23089
## - Rain 1 201836 46370681 23092
## - TimeNum 1 449961 46618805 23105
## - HumidityHive 1 494790 46663634 23107
## - TemperatureHive 1 1446988 47615832 23154
## - Barometer 1 2394160 48563005 23200
## - TemperatureOutside 1 2415164 48584008 23201
## - DateNum 1 22690696 68859541 24014
audioVifStep <- lm(miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +
TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +
WindDir, data=merged)
summary(audioVifStep)
##
## Call:
## lm(formula = miks.boden._1 ~ DateNum + TimeNum + HumidityHive +
## TemperatureHive + TemperatureOutside + Barometer + HumidyOutside +
## Rain + WindSpeed + WindDir, data = merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -565.97 -88.76 -4.52 73.43 550.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.577e+03 7.312e+02 -7.628 3.47e-14 ***
## DateNum -7.254e+00 2.148e-01 -33.767 < 2e-16 ***
## TimeNum 3.646e-01 7.668e-02 4.755 2.11e-06 ***
## HumidityHive 2.235e+00 4.482e-01 4.986 6.61e-07 ***
## TemperatureHive -3.464e+01 4.062e+00 -8.527 < 2e-16 ***
## TemperatureOutside 1.461e+01 1.326e+00 11.016 < 2e-16 ***
## Barometer 7.768e+00 7.082e-01 10.968 < 2e-16 ***
## HumidyOutside 1.040e+00 4.091e-01 2.541 0.01113 *
## Rain 1.189e+02 3.733e+01 3.185 0.00147 **
## WindSpeed 2.528e+00 9.846e-01 2.568 0.01029 *
## WindDir 3.786e-02 2.631e-02 1.439 0.15036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.1 on 2320 degrees of freedom
## (7590 observations deleted due to missingness)
## Multiple R-squared: 0.6086, Adjusted R-squared: 0.6069
## F-statistic: 360.7 on 10 and 2320 DF, p-value: < 2.2e-16
It would be reasonable to have some interaction between Date and Time, as the timepoint of the sunrise changes every day. However, as long as Date is numeric, we do not find a significant relationship:
# interaction not significant:
inter <- lm(miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +
TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +
WindDir+ TimeNum:DateNum, data=merged)
summary(inter)
##
## Call:
## lm(formula = miks.boden._1 ~ DateNum + TimeNum + HumidityHive +
## TemperatureHive + TemperatureOutside + Barometer + HumidyOutside +
## Rain + WindSpeed + WindDir + TimeNum:DateNum, data = merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -560.62 -90.10 -3.93 74.29 550.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.536e+03 7.314e+02 -7.569 5.38e-14 ***
## DateNum -6.778e+00 3.714e-01 -18.247 < 2e-16 ***
## TimeNum 5.874e-01 1.612e-01 3.644 0.000275 ***
## HumidityHive 2.222e+00 4.481e-01 4.959 7.59e-07 ***
## TemperatureHive -3.520e+01 4.077e+00 -8.635 < 2e-16 ***
## TemperatureOutside 1.473e+01 1.328e+00 11.094 < 2e-16 ***
## Barometer 7.729e+00 7.084e-01 10.910 < 2e-16 ***
## HumidyOutside 1.042e+00 4.090e-01 2.548 0.010901 *
## Rain 1.177e+02 3.733e+01 3.152 0.001642 **
## WindSpeed 2.470e+00 9.850e-01 2.507 0.012238 *
## WindDir 3.970e-02 2.633e-02 1.508 0.131732
## DateNum:TimeNum -6.600e-03 4.202e-03 -1.571 0.116356
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141 on 2319 degrees of freedom
## (7590 observations deleted due to missingness)
## Multiple R-squared: 0.609, Adjusted R-squared: 0.6071
## F-statistic: 328.4 on 11 and 2319 DF, p-value: < 2.2e-16
Using Date as factor results in a significant relationship. Nevertheless, this is not very helpful for generalisations. There seem to be other factors then the duration of sunlight, which are not part of our data.
# but with factor highly significant:
interfac <- lm(miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +
TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +
WindDir+ TimeNum:Date, data=merged)
summary(interfac)
##
## Call:
## lm(formula = miks.boden._1 ~ DateNum + TimeNum + HumidityHive +
## TemperatureHive + TemperatureOutside + Barometer + HumidyOutside +
## Rain + WindSpeed + WindDir + TimeNum:Date, data = merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -520.80 -76.52 -0.05 72.94 554.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.622e+03 1.087e+03 -1.493 0.135698
## DateNum -7.041e+00 3.473e-01 -20.272 < 2e-16 ***
## TimeNum 1.844e+00 3.645e-01 5.060 4.52e-07 ***
## HumidityHive 1.464e+00 4.843e-01 3.022 0.002537 **
## TemperatureHive -4.668e+01 4.346e+00 -10.741 < 2e-16 ***
## TemperatureOutside 1.329e+01 1.816e+00 7.318 3.45e-13 ***
## Barometer 4.209e+00 1.052e+00 4.002 6.49e-05 ***
## HumidyOutside 2.463e+00 5.237e-01 4.704 2.71e-06 ***
## Rain 1.567e+01 3.537e+01 0.443 0.657766
## WindSpeed 3.672e+00 9.536e-01 3.850 0.000121 ***
## WindDir 1.054e-01 2.505e-02 4.207 2.69e-05 ***
## TimeNum:Date4/14/2013 -7.047e-01 3.531e-01 -1.996 0.046048 *
## TimeNum:Date4/15/2013 -2.157e-01 3.685e-01 -0.585 0.558271
## TimeNum:Date4/16/2013 -5.487e-01 3.946e-01 -1.390 0.164545
## TimeNum:Date4/17/2013 -3.059e-01 4.097e-01 -0.747 0.455353
## TimeNum:Date4/18/2013 -6.684e-02 4.176e-01 -0.160 0.872852
## TimeNum:Date4/27/2013 -3.074e+00 4.789e-01 -6.418 1.67e-10 ***
## TimeNum:Date4/28/2013 -2.324e+00 3.673e-01 -6.329 2.96e-10 ***
## TimeNum:Date4/29/2013 -2.652e+00 3.720e-01 -7.130 1.34e-12 ***
## TimeNum:Date4/30/2013 -2.689e+00 3.769e-01 -7.134 1.30e-12 ***
## TimeNum:Date5/1/2013 -2.636e+00 3.885e-01 -6.785 1.47e-11 ***
## TimeNum:Date5/17/2013 -3.592e+00 6.807e-01 -5.277 1.44e-07 ***
## TimeNum:Date5/18/2013 -1.814e+00 5.276e-01 -3.439 0.000595 ***
## TimeNum:Date5/22/2013 -1.842e+00 5.665e-01 -3.251 0.001168 **
## TimeNum:Date5/23/2013 -2.061e-01 4.249e-01 -0.485 0.627738
## TimeNum:Date5/24/2013 -2.719e+00 4.187e-01 -6.495 1.02e-10 ***
## TimeNum:Date5/25/2013 -2.287e+00 4.370e-01 -5.234 1.81e-07 ***
## TimeNum:Date5/26/2013 -3.217e+00 4.326e-01 -7.437 1.45e-13 ***
## TimeNum:Date5/27/2013 -1.851e+00 4.540e-01 -4.076 4.74e-05 ***
## TimeNum:Date5/28/2013 -3.692e+00 7.901e-01 -4.673 3.13e-06 ***
## TimeNum:Date5/29/2013 -2.207e+00 4.377e-01 -5.043 4.94e-07 ***
## TimeNum:Date5/30/2013 -2.976e+00 4.184e-01 -7.112 1.52e-12 ***
## TimeNum:Date5/31/2013 -9.381e-01 4.160e-01 -2.255 0.024232 *
## TimeNum:Date5/3/2013 -2.602e+00 4.917e-01 -5.291 1.33e-07 ***
## TimeNum:Date5/4/2013 -1.093e+00 3.894e-01 -2.807 0.005045 **
## TimeNum:Date5/5/2013 -8.409e-01 3.937e-01 -2.136 0.032805 *
## TimeNum:Date5/6/2013 -1.108e+00 4.032e-01 -2.748 0.006046 **
## TimeNum:Date5/7/2013 -1.252e+00 4.111e-01 -3.045 0.002356 **
## TimeNum:Date5/8/2013 -9.880e-01 4.015e-01 -2.461 0.013932 *
## TimeNum:Date5/9/2013 -1.137e+00 4.302e-01 -2.642 0.008291 **
## TimeNum:Date6/1/2013 -2.617e+00 4.446e-01 -5.885 4.56e-09 ***
## TimeNum:Date6/2/2013 -2.919e+00 4.393e-01 -6.645 3.79e-11 ***
## TimeNum:Date6/3/2013 -1.539e+00 4.263e-01 -3.612 0.000311 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124.6 on 2288 degrees of freedom
## (7590 observations deleted due to missingness)
## Multiple R-squared: 0.6988, Adjusted R-squared: 0.6933
## F-statistic: 126.4 on 42 and 2288 DF, p-value: < 2.2e-16
When performing a linear regression, we indirectly make some assumptions about the modelled data. * We assume that each relationship between one of the covariates and the dependent variable is linear. * We assume that the variance of the model errors is constant. * We assume that errors are independent. * We assume that the errors are normally distributed.
This assumption seems to be fulfilled. It can be accessed via partial residual plots. The relationship should be indicated by a straight line.
# Linearity
#termplot(audioVifStep,se=T)
termplot(audioVifStep,se=T,partial.resid=T)
This can be checked with the Breusch-Pagan test. The null hypothesis is, that the error variance is constant, while the alternative hypothesis is, that the variance varies with the fitted values/ a linear combination of the predictors. If the p-value is higher than 0.05, we do not reject the null hypothesis. So, there seems to be constant error variance.
# Constant Error Variance
require(car)
ncvTest(audioVifStep)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.7215417, Df = 1, p = 0.39564
Usually, this is checked by looking at the errors of the model sorted in observation order. To make sure, that the data used for fitting the model was sorted, we sort the data and refit the model. Than we perform the Durbin-Watson test, where the null-hypothesis is, that we have independence. We can also look at auto-correlation.
There clearly is some correlation. As Independence is a critical assumption, we should handle the results with caution. Some of the significant coefficents might actually be irrelevant. We will find a way to cater for the dependence with Generalised Least Squares.
# Independence
# sort:
merged <- merged[order(merged$timestamp),]
audioVifStep <- lm(miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +
TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +
WindDir, data=merged)
plot(1:length(rstandard(audioVifStep)), rstandard(audioVifStep), xlab="Observation Order")
durbinWatsonTest(audioVifStep)
## lag Autocorrelation D-W Statistic p-value
## 1 0.6861646 0.6256698 0
## Alternative hypothesis: rho != 0
acf(residuals(audioVifStep))
The errors seem to be normally distributed.
# Normality
hist(rstandard(audioVifStep), main="Model residuals")
It would have been nice to explain the variance of our data with a simple linear regression. As we added Date and Time in our model, we hoped to adress potential dependencies. However, we found that the errors are not independent. To make sure that we can trust our model, we refit it with a correlation-structure.
Let’s try a first order correlation structure.
library(nlme)
audioar1 <- gls(miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +
TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +
WindDir, data=merged, correlation=corAR1(form =~1), method="ML", na.action = na.omit)
summary(audioar1)
## Generalized least squares fit by maximum likelihood
## Model: miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive + TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed + WindDir
## Data: merged
## AIC BIC logLik
## 28191.03 28265.83 -14082.52
##
## Correlation Structure: AR(1)
## Formula: ~1
## Parameter estimate(s):
## Phi
## 0.6970528
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -7184.926 1501.3838 -4.785536 0.0000
## DateNum -7.175 0.4686 -15.312454 0.0000
## TimeNum 0.085 0.1084 0.781113 0.4348
## HumidityHive 0.885 0.4510 1.962738 0.0498
## TemperatureHive -31.385 6.9141 -4.539281 0.0000
## TemperatureOutside 12.819 2.7600 4.644511 0.0000
## Barometer 9.399 1.4524 6.471680 0.0000
## HumidyOutside 0.853 0.8021 1.063288 0.2878
## Rain 72.232 33.5628 2.152157 0.0315
## WindSpeed -0.005 1.0486 -0.004756 0.9962
## WindDir 0.031 0.0216 1.441021 0.1497
##
## Correlation:
## (Intr) DateNm TimeNm HmdtyH TmprtH TmprtO Barmtr HmdyOt
## DateNum -0.172
## TimeNum 0.065 0.051
## HumidityHive 0.073 -0.121 -0.018
## TemperatureHive -0.215 -0.304 -0.081 0.208
## TemperatureOutside 0.162 -0.243 0.050 -0.048 -0.057
## Barometer -0.987 0.233 -0.057 -0.126 0.067 -0.210
## HumidyOutside -0.021 -0.321 -0.075 0.010 -0.011 0.826 -0.034
## Rain -0.055 -0.023 0.020 0.020 -0.021 -0.002 0.059 -0.019
## WindSpeed -0.160 -0.009 0.035 -0.051 -0.019 0.027 0.160 0.109
## WindDir -0.085 0.057 0.031 0.010 0.005 -0.072 0.086 -0.031
## Rain WndSpd
## DateNum
## TimeNum
## HumidityHive
## TemperatureHive
## TemperatureOutside
## Barometer
## HumidyOutside
## Rain
## WindSpeed 0.053
## WindDir -0.025 -0.092
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -3.99783692 -0.63130075 -0.01700845 0.53851352 3.82663588
##
## Residual standard error: 141.8792
## Degrees of freedom: 2331 total; 2320 residual
In order to check our correlation structure, we can look at the ACF plots:
AIC(audioVifStep, audioar1)
## df AIC
## audioVifStep 12 29701.45
## audioar1 13 28191.03
#acf(residuals(audioVifStep), main="Residual Indep.")
acf(residuals(audioar1, type="normalized"),
main="Residuals for AR(1) model")
For our previous model we had correlations even for measurements with a distance of 10 time-points. This is much better with the new structure.
Compared to the previous model, there is not such an obvious pattern in the residuals anymore.
plot(1:length(residuals(audioar1, type="normalized")), residuals(audioar1, type="normalized"), xlab="Observation Order")
The residuals seem to be normally distributed:
hist(residuals(audioar1, type = "normalized"), main="Model residuals", n=20)
The p-values between the models differ quite a bit:
res<- Anova(audioVifStep, type=3)
res1 <- summary(audioar1)
data.frame( "Standard Multiple Regression"= round(res$`Pr(>F)`, digits = 4)[1:10],
"Generalised Linear Squares"=round(res1$tTable[,4], digits =4)[2:11],
row.names = row.names(res)[1:10] )
## Standard.Multiple.Regression Generalised.Linear.Squares
## (Intercept) 0.0000 0.0000
## DateNum 0.0000 0.4348
## TimeNum 0.0000 0.0498
## HumidityHive 0.0000 0.0000
## TemperatureHive 0.0000 0.0000
## TemperatureOutside 0.0000 0.0000
## Barometer 0.0000 0.2878
## HumidyOutside 0.0111 0.0315
## Rain 0.0015 0.9962
## WindSpeed 0.0103 0.1497
Meanwhile, the estimates tell us a similar story. Here is the coefficient table for the simple linear regression model:
sum <- summary(audioVifStep)
kable(round(sum$coefficients, digits = 4))
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -5577.0088 | 731.1698 | -7.6275 | 0.0000 |
DateNum | -7.2537 | 0.2148 | -33.7671 | 0.0000 |
TimeNum | 0.3646 | 0.0767 | 4.7551 | 0.0000 |
HumidityHive | 2.2348 | 0.4482 | 4.9863 | 0.0000 |
TemperatureHive | -34.6396 | 4.0623 | -8.5271 | 0.0000 |
TemperatureOutside | 14.6059 | 1.3258 | 11.0165 | 0.0000 |
Barometer | 7.7682 | 0.7082 | 10.9685 | 0.0000 |
HumidyOutside | 1.0395 | 0.4091 | 2.5408 | 0.0111 |
Rain | 118.8902 | 37.3316 | 3.1847 | 0.0015 |
WindSpeed | 2.5284 | 0.9846 | 2.5680 | 0.0103 |
WindDir | 0.0379 | 0.0263 | 1.4387 | 0.1504 |
And here the table for the model with correlation structure:
kable(round(res1$tTable, digits = 4))
Value | Std.Error | t-value | p-value | |
---|---|---|---|---|
(Intercept) | -7184.9265 | 1501.3838 | -4.7855 | 0.0000 |
DateNum | -7.1747 | 0.4686 | -15.3125 | 0.0000 |
TimeNum | 0.0847 | 0.1084 | 0.7811 | 0.4348 |
HumidityHive | 0.8852 | 0.4510 | 1.9627 | 0.0498 |
TemperatureHive | -31.3850 | 6.9141 | -4.5393 | 0.0000 |
TemperatureOutside | 12.8188 | 2.7600 | 4.6445 | 0.0000 |
Barometer | 9.3992 | 1.4524 | 6.4717 | 0.0000 |
HumidyOutside | 0.8528 | 0.8021 | 1.0633 | 0.2878 |
Rain | 72.2324 | 33.5628 | 2.1522 | 0.0315 |
WindSpeed | -0.0050 | 1.0486 | -0.0048 | 0.9962 |
WindDir | 0.0311 | 0.0216 | 1.4410 | 0.1497 |
Let us focus on the results from the GLS model, as it seems to be more stable. It also improved the AIC value from 29 701 to 28 191.
The correlation parameter of the model is 0.7, which means that the mean frequencies for two subsequent 10-minute periods have a correlation of 0.7, measurements 20 minutes apart are assumed to have a correlation of \(0.7^2=0.49\) and so on. This is considered in the error term of the model.
Overall, the frequency seems to decrease in the time period betwen the 13th of April 2013 and the 7th of June 2013. Roughly speaking, the frequency decreases by 7 Hz every day. This can also be seen from the frequency plotted in timeorder. Only in the second half of June the frequency increases, however we do not have weather data for this period.
plot(merged$timestamp, merged$miks.boden._1)
A one-percent increase in humidity is related to an increase of the frequency by 0.9 Hz.
An increase in temperature in the hive by 1 °C corresponds to a decrease of the frequency by 31 Hz.
An increase in temperature outside of the hive by 1 °C corresponds to an increase of the frequency by 12.8 Hz.
A 1 hPa increase in air pressure is related to a 9.4 Hz frequency increase.
1 mm of rain results in an increase of the mean frequency by 70 Hz. In fact, we rarely had rainfall higher than 0.2 mm. The people from BeABee spend some effort to build a roof over there hive. However, we cannot tell from our model, if the frequency increase occurs, because the the bees behave differently during rain, or if rainfall was recorded.