1 Introduction

BeABee was an art and science project conducted by a group in Switzerland. Its aim was to identify new patterns in bee sound data and possibly map the measurements with labels regarding the state of the bee population. The data was also used for an auditive art installation.

BeABee made their data available to us and we performed a Fast Fourier Transform (FFT) on the sound files. In this document, we present some exploratory results. Mainly, we investigate how the frequency of the sounds change over time and how it correlates with the weather data.

2 Methods

2.1 Setup

BeABee installed 6 microphones:

  • within the hive, positioned at the center of the floor
  • outside of the hive
  • on the right side of the entree canal
  • on the left side of the entree canal
  • at the bottom-left side, within the hive
  • at the bottom-right side, within the hive

Additionally, they used piezo contact microphones, which have not been considered in our analysis yet.

2.2 Analysis

The FFT was performed on 10-minute bins. For the calculation of the mean, median and dominant frequency we did not employ any filter and the high frequency resolution is remained. We saved the mean of 20 Hz frequency bins in a range between 0 and 1000 Hz. The code can be found in fft/.

For a first analysis, we used some exploratory methods, a multiple linear regression model and a linear regression with a correlation structure (Generalised Least Squares). So far, we focus on only one microphone, the microphone on the left side in the bottom of the hive.

2.3 Data

The weather data can be accessed under http://beabee.ch/forschungsdaten.html and in this repository under meteoNew.csv.

The original audio data has a size of approximately 2.8 TB and is therefore not published.

The output of the FFT comes in aggregated form. It contains the mean for 20 Hz frequency bins (columns X0 - X980), the mean frequency (mean), its standard deviation (sd) and standard error (sem), the median frequency (median), the dominant frequency (mode), the channel, a timestamp and the filename.

Here are the first 3 entries for the highest frequency bins and the further columns:

df <- read.table("aggregatedBeABeeAudio.csv", header = T, sep=",")
library(knitr)
kable(head(df[,49:58], n = 3))
X960 X980 mean sd sem median mode channel timestamp file
0.0033774 0.0023070 3710.801 5224.228 1.436290 1246.955 50.03667 2 2013-05-01 00:30:00 mik innen-aussen-01-may-00-30.wav
0.0035939 0.0035109 3244.735 4944.609 1.359415 1112.138 50.02500 2 2013-05-01 00:40:00 mik innen-aussen-01-may-00-30.wav
0.0052965 0.0072284 3535.596 5200.648 1.429807 1135.558 50.01834 2 2013-05-01 00:50:00 mik innen-aussen-01-may-00-30.wav

We create a script to disaggregate the data and to merge it with the weather data. It is hidden behind the “code” button.

## This is a script to read the aggregated beabee-audio data and to disaggregate it

#### read aggregated data
df <- read.table("aggregatedBeABeeAudio.csv", header = T, sep=",")
# create a type column, identifying the mice
df$type <- substr(df$file,1,11)
unique(df$type)
##### disaggregate. You can choose an other column than the mean frequency
library(reshape)
disagg <- cast(df, timestamp~type+channel, value="mean")
nrow(disagg)
names(disagg)
#### combine with weather data
weather <- read.table("meteoNew.csv", header = T, sep=",")
#### get the timestamp in the correct format
# for disagg, at midnight "00:00:00" is missing
disagg$timestamp <- as.character(disagg$timestamp)
disagg$timestamp <- ifelse(nchar(disagg$timestamp) <= 10,yes = paste(disagg$timestamp, "00:00:00"), no = disagg$timestamp)
# for weather, it sometimes has the form 7:00:00 instead of 07:00:00 (as example)
weather$timestamp <- paste(weather$Date, weather$Time)
weather$timestamp <- as.character(as.POSIXlt(weather$timestamp, format="%m/%d/%Y %H:%M:%S"))
#### merge data
merged <-  merge(disagg, weather, by="timestamp", all = T)
# controll which timestamps are missing
head(disagg[!(disagg$timestamp %in% weather$timestamp),], n = 50)
missingweather <- merged[is.na(merged$Rain),]
nrow(missingweather)
nrow(merged)
# save the data 
write.csv(merged, file= "mergedmean.csv", sep=",", quote = F, row.names = F)

The most important line of the script is

disagg <- cast(df, timestamp~type+channel, value="mean")

Here the data gets disaggregated, with “mean” being the only measurement retained for all microphones.

The script results in data in the following format:

require(lattice)
require(ggplot2)
source("cor.R")
merged <-  read.csv("mergedmean.csv")
library(knitr)
kable(head(merged[15:18,], n = 3))
timestamp mik.innen.a_1 mik.innen.a_2 miks.boden._1 miks.boden._2 miks.kanal._1 miks.kanal._2 Date Time HumidityHive TemperatureHive WatervapourdenHive TemperatureOutside Barometer HumidyOutside Rain WindSpeed WindDir
15 2013-04-13 11:00:00 NA NA 1327.220 1394.588 NA NA 4/13/2013 11:00:00 100.00 10.14 9.471 9.05556 1022.35 76 0 3.22 202
16 2013-04-13 11:10:00 NA NA 1929.673 2068.534 NA NA 4/13/2013 11:10:00 79.99 22.38 15.858 9.27778 1022.46 75 0 1.61 180
17 2013-04-13 11:20:00 NA NA 1383.662 1537.951 NA NA 4/13/2013 11:20:00 59.75 24.23 13.163 10.05560 1022.52 75 0 0.00 225

where mik.innen.a_1 is within the hive, positioned at the center of the floor, mik.innen.a_2 is outside of the hive miks.boden._1 is at the bottom-left side, within the hive miks.boden._2 is at the bottom-right side, within the hive miks.kanal._1 is on the right side of the entree canal miks.kanal._2 is on the left side of the entree canal.

3 Results

3.1 Exploratory Analysis

3.1.1 Outlier

We decided to remove temperature measurements under 27.5 °C, as we suspect they only occured when the sensor was just installed.

# remove value
#nonn <- merged[!is.na(merged$mik.innen.a_1),]
#merged <- merged[-3789,] 
# nothing can be heard in the audio, let's remove it
#hist(merged$TemperatureHive)
merged <- merged[-which(merged$TemperatureHive<27.5),]

3.1.2 Relationships

Let us just look at some of the relationships between measurements. We find many statistically significant relationships (significance level indicated by stars, 3 stars is the best).

pairs(merged[, 10:12], lower.panel=panel.smooth, upper.panel=panel.cor)

pairs(merged[, c(2, 4, 13, 15)], lower.panel=panel.smooth, upper.panel=panel.cor)

Especcialy, let us look at the relationships between our dependent and the independent variables. We converted Time and Date into numerical variables by matching each date/ time with an index. For some covariates is hard to say, if a linear relationship is justified, for example wind speed, which is not measured continously either. For others, like temperature outside, there is quite an obvious relation. Looking at rain, it might be a good idea to transform this to a factor variable with a “rain” and a “no rain” option.

coefs <- 10:20
ab <- 4
merged$DateNum <- as.numeric(merged$Date)
merged$TimeNum <- as.numeric(merged$Time)
for(i in 1:length(coefs)){
  plot(merged[,coefs[i]], merged[,ab], xlab=names(merged)[coefs[i]],
       ylab=names(merged)[ab])
}

3.2 Multiple Linear Regression

For some of the covariates it is hard to say, if there is a linear relationship. A Multiple Linear Regression therefore might no be the best choice and results are to be handled if caution, but it can give us as a good first impression.

3.2.1 Model selection

We start with a model considering all covariates.

audio <- lm(miks.boden._1 ~ DateNum+ TimeNum+ HumidityHive+ TemperatureHive+ WatervapourdenHive+ TemperatureOutside+  Barometer+ HumidyOutside+ Rain+ WindSpeed+ WindDir , data=merged)
summary(audio)
## 
## Call:
## lm(formula = miks.boden._1 ~ DateNum + TimeNum + HumidityHive + 
##     TemperatureHive + WatervapourdenHive + TemperatureOutside + 
##     Barometer + HumidyOutside + Rain + WindSpeed + WindDir, data = merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -565.91  -88.77   -4.48   73.64  550.47 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.496e+03  1.305e+03  -4.213 2.61e-05 ***
## DateNum            -7.256e+00  2.176e-01 -33.346  < 2e-16 ***
## TimeNum             3.645e-01  7.673e-02   4.750 2.16e-06 ***
## HumidityHive        1.576e+00  8.841e+00   0.178   0.8586    
## TemperatureHive    -3.707e+01  3.280e+01  -1.130   0.2585    
## WatervapourdenHive  1.902e+00  2.548e+01   0.075   0.9405    
## TemperatureOutside  1.461e+01  1.326e+00  11.014  < 2e-16 ***
## Barometer           7.766e+00  7.088e-01  10.958  < 2e-16 ***
## HumidyOutside       1.041e+00  4.099e-01   2.540   0.0111 *  
## Rain                1.191e+02  3.748e+01   3.178   0.0015 ** 
## WindSpeed           2.526e+00  9.856e-01   2.562   0.0105 *  
## WindDir             3.793e-02  2.634e-02   1.440   0.1500    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.1 on 2319 degrees of freedom
##   (7590 observations deleted due to missingness)
## Multiple R-squared:  0.6086, Adjusted R-squared:  0.6067 
## F-statistic: 327.8 on 11 and 2319 DF,  p-value: < 2.2e-16

Before we use this modell, we should check for collinearity between covariates. Collinearity makes the model unstable and the standard errors for the estimates might be very high. We can detect collinearity using variance inflation factors (VIFs):

library(car)
## Loading required package: carData
vif(audio)
##            DateNum            TimeNum       HumidityHive 
##           1.652567           1.196500         517.771899 
##    TemperatureHive WatervapourdenHive TemperatureOutside 
##          89.249233         515.791478           4.478465 
##          Barometer      HumidyOutside               Rain 
##           1.609649           4.865273           1.116961 
##          WindSpeed            WindDir 
##           1.216154           1.124947

There is no clear rule which tells us, if a VIF is good or to high. Usually, a VIF should not be higher than 5 or 10. We therefore decided to remove WatervapourdenHive from our model:

audioVif <- lm(miks.boden._1 ~ DateNum+ TimeNum+ HumidityHive+ TemperatureHive+  TemperatureOutside+  Barometer+ HumidyOutside+ Rain+ WindSpeed+ WindDir, data=merged)
summary(audioVif)
## 
## Call:
## lm(formula = miks.boden._1 ~ DateNum + TimeNum + HumidityHive + 
##     TemperatureHive + TemperatureOutside + Barometer + HumidyOutside + 
##     Rain + WindSpeed + WindDir, data = merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -565.97  -88.76   -4.52   73.43  550.19 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.577e+03  7.312e+02  -7.628 3.47e-14 ***
## DateNum            -7.254e+00  2.148e-01 -33.767  < 2e-16 ***
## TimeNum             3.646e-01  7.668e-02   4.755 2.11e-06 ***
## HumidityHive        2.235e+00  4.482e-01   4.986 6.61e-07 ***
## TemperatureHive    -3.464e+01  4.062e+00  -8.527  < 2e-16 ***
## TemperatureOutside  1.461e+01  1.326e+00  11.016  < 2e-16 ***
## Barometer           7.768e+00  7.082e-01  10.968  < 2e-16 ***
## HumidyOutside       1.040e+00  4.091e-01   2.541  0.01113 *  
## Rain                1.189e+02  3.733e+01   3.185  0.00147 ** 
## WindSpeed           2.528e+00  9.846e-01   2.568  0.01029 *  
## WindDir             3.786e-02  2.631e-02   1.439  0.15036    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.1 on 2320 degrees of freedom
##   (7590 observations deleted due to missingness)
## Multiple R-squared:  0.6086, Adjusted R-squared:  0.6069 
## F-statistic: 360.7 on 10 and 2320 DF,  p-value: < 2.2e-16

We can look at the VIFs again:

vif(audioVif)
##            DateNum            TimeNum       HumidityHive 
##           1.611198           1.195549           1.331180 
##    TemperatureHive TemperatureOutside          Barometer 
##           1.369814           4.478452           1.607939 
##      HumidyOutside               Rain          WindSpeed 
##           4.849861           1.108363           1.214228 
##            WindDir 
##           1.123512

Now, the VIF values are looking good.

Looking at AIC values, it does not matter if we keep WindDirection in the model. All other covariates seem to be significant! This can be seen from the summary output above.

step.linearAllAIC<- step(audioVif, direction="both")
## Start:  AIC=23084.36
## miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive + 
##     TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed + 
##     WindDir
## 
##                      Df Sum of Sq      RSS   AIC
## <none>                            46168844 23084
## - WindDir             1     41193 46210037 23084
## - HumidyOutside       1    128468 46297313 23089
## - WindSpeed           1    131237 46300082 23089
## - Rain                1    201836 46370681 23092
## - TimeNum             1    449961 46618805 23105
## - HumidityHive        1    494790 46663634 23107
## - TemperatureHive     1   1446988 47615832 23154
## - Barometer           1   2394160 48563005 23200
## - TemperatureOutside  1   2415164 48584008 23201
## - DateNum             1  22690696 68859541 24014
audioVifStep <- lm(miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +
    TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +
    WindDir, data=merged)
summary(audioVifStep)
## 
## Call:
## lm(formula = miks.boden._1 ~ DateNum + TimeNum + HumidityHive + 
##     TemperatureHive + TemperatureOutside + Barometer + HumidyOutside + 
##     Rain + WindSpeed + WindDir, data = merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -565.97  -88.76   -4.52   73.43  550.19 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.577e+03  7.312e+02  -7.628 3.47e-14 ***
## DateNum            -7.254e+00  2.148e-01 -33.767  < 2e-16 ***
## TimeNum             3.646e-01  7.668e-02   4.755 2.11e-06 ***
## HumidityHive        2.235e+00  4.482e-01   4.986 6.61e-07 ***
## TemperatureHive    -3.464e+01  4.062e+00  -8.527  < 2e-16 ***
## TemperatureOutside  1.461e+01  1.326e+00  11.016  < 2e-16 ***
## Barometer           7.768e+00  7.082e-01  10.968  < 2e-16 ***
## HumidyOutside       1.040e+00  4.091e-01   2.541  0.01113 *  
## Rain                1.189e+02  3.733e+01   3.185  0.00147 ** 
## WindSpeed           2.528e+00  9.846e-01   2.568  0.01029 *  
## WindDir             3.786e-02  2.631e-02   1.439  0.15036    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.1 on 2320 degrees of freedom
##   (7590 observations deleted due to missingness)
## Multiple R-squared:  0.6086, Adjusted R-squared:  0.6069 
## F-statistic: 360.7 on 10 and 2320 DF,  p-value: < 2.2e-16

It would be reasonable to have some interaction between Date and Time, as the timepoint of the sunrise changes every day. However, as long as Date is numeric, we do not find a significant relationship:

# interaction not significant:
inter <- lm(miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +
    TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +
    WindDir+ TimeNum:DateNum, data=merged)
summary(inter)
## 
## Call:
## lm(formula = miks.boden._1 ~ DateNum + TimeNum + HumidityHive + 
##     TemperatureHive + TemperatureOutside + Barometer + HumidyOutside + 
##     Rain + WindSpeed + WindDir + TimeNum:DateNum, data = merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -560.62  -90.10   -3.93   74.29  550.68 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.536e+03  7.314e+02  -7.569 5.38e-14 ***
## DateNum            -6.778e+00  3.714e-01 -18.247  < 2e-16 ***
## TimeNum             5.874e-01  1.612e-01   3.644 0.000275 ***
## HumidityHive        2.222e+00  4.481e-01   4.959 7.59e-07 ***
## TemperatureHive    -3.520e+01  4.077e+00  -8.635  < 2e-16 ***
## TemperatureOutside  1.473e+01  1.328e+00  11.094  < 2e-16 ***
## Barometer           7.729e+00  7.084e-01  10.910  < 2e-16 ***
## HumidyOutside       1.042e+00  4.090e-01   2.548 0.010901 *  
## Rain                1.177e+02  3.733e+01   3.152 0.001642 ** 
## WindSpeed           2.470e+00  9.850e-01   2.507 0.012238 *  
## WindDir             3.970e-02  2.633e-02   1.508 0.131732    
## DateNum:TimeNum    -6.600e-03  4.202e-03  -1.571 0.116356    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141 on 2319 degrees of freedom
##   (7590 observations deleted due to missingness)
## Multiple R-squared:  0.609,  Adjusted R-squared:  0.6071 
## F-statistic: 328.4 on 11 and 2319 DF,  p-value: < 2.2e-16

Using Date as factor results in a significant relationship. Nevertheless, this is not very helpful for generalisations. There seem to be other factors then the duration of sunlight, which are not part of our data.

# but with factor highly significant:
interfac <- lm(miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +
    TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +
    WindDir+ TimeNum:Date, data=merged)
 summary(interfac)
## 
## Call:
## lm(formula = miks.boden._1 ~ DateNum + TimeNum + HumidityHive + 
##     TemperatureHive + TemperatureOutside + Barometer + HumidyOutside + 
##     Rain + WindSpeed + WindDir + TimeNum:Date, data = merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -520.80  -76.52   -0.05   72.94  554.47 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.622e+03  1.087e+03  -1.493 0.135698    
## DateNum               -7.041e+00  3.473e-01 -20.272  < 2e-16 ***
## TimeNum                1.844e+00  3.645e-01   5.060 4.52e-07 ***
## HumidityHive           1.464e+00  4.843e-01   3.022 0.002537 ** 
## TemperatureHive       -4.668e+01  4.346e+00 -10.741  < 2e-16 ***
## TemperatureOutside     1.329e+01  1.816e+00   7.318 3.45e-13 ***
## Barometer              4.209e+00  1.052e+00   4.002 6.49e-05 ***
## HumidyOutside          2.463e+00  5.237e-01   4.704 2.71e-06 ***
## Rain                   1.567e+01  3.537e+01   0.443 0.657766    
## WindSpeed              3.672e+00  9.536e-01   3.850 0.000121 ***
## WindDir                1.054e-01  2.505e-02   4.207 2.69e-05 ***
## TimeNum:Date4/14/2013 -7.047e-01  3.531e-01  -1.996 0.046048 *  
## TimeNum:Date4/15/2013 -2.157e-01  3.685e-01  -0.585 0.558271    
## TimeNum:Date4/16/2013 -5.487e-01  3.946e-01  -1.390 0.164545    
## TimeNum:Date4/17/2013 -3.059e-01  4.097e-01  -0.747 0.455353    
## TimeNum:Date4/18/2013 -6.684e-02  4.176e-01  -0.160 0.872852    
## TimeNum:Date4/27/2013 -3.074e+00  4.789e-01  -6.418 1.67e-10 ***
## TimeNum:Date4/28/2013 -2.324e+00  3.673e-01  -6.329 2.96e-10 ***
## TimeNum:Date4/29/2013 -2.652e+00  3.720e-01  -7.130 1.34e-12 ***
## TimeNum:Date4/30/2013 -2.689e+00  3.769e-01  -7.134 1.30e-12 ***
## TimeNum:Date5/1/2013  -2.636e+00  3.885e-01  -6.785 1.47e-11 ***
## TimeNum:Date5/17/2013 -3.592e+00  6.807e-01  -5.277 1.44e-07 ***
## TimeNum:Date5/18/2013 -1.814e+00  5.276e-01  -3.439 0.000595 ***
## TimeNum:Date5/22/2013 -1.842e+00  5.665e-01  -3.251 0.001168 ** 
## TimeNum:Date5/23/2013 -2.061e-01  4.249e-01  -0.485 0.627738    
## TimeNum:Date5/24/2013 -2.719e+00  4.187e-01  -6.495 1.02e-10 ***
## TimeNum:Date5/25/2013 -2.287e+00  4.370e-01  -5.234 1.81e-07 ***
## TimeNum:Date5/26/2013 -3.217e+00  4.326e-01  -7.437 1.45e-13 ***
## TimeNum:Date5/27/2013 -1.851e+00  4.540e-01  -4.076 4.74e-05 ***
## TimeNum:Date5/28/2013 -3.692e+00  7.901e-01  -4.673 3.13e-06 ***
## TimeNum:Date5/29/2013 -2.207e+00  4.377e-01  -5.043 4.94e-07 ***
## TimeNum:Date5/30/2013 -2.976e+00  4.184e-01  -7.112 1.52e-12 ***
## TimeNum:Date5/31/2013 -9.381e-01  4.160e-01  -2.255 0.024232 *  
## TimeNum:Date5/3/2013  -2.602e+00  4.917e-01  -5.291 1.33e-07 ***
## TimeNum:Date5/4/2013  -1.093e+00  3.894e-01  -2.807 0.005045 ** 
## TimeNum:Date5/5/2013  -8.409e-01  3.937e-01  -2.136 0.032805 *  
## TimeNum:Date5/6/2013  -1.108e+00  4.032e-01  -2.748 0.006046 ** 
## TimeNum:Date5/7/2013  -1.252e+00  4.111e-01  -3.045 0.002356 ** 
## TimeNum:Date5/8/2013  -9.880e-01  4.015e-01  -2.461 0.013932 *  
## TimeNum:Date5/9/2013  -1.137e+00  4.302e-01  -2.642 0.008291 ** 
## TimeNum:Date6/1/2013  -2.617e+00  4.446e-01  -5.885 4.56e-09 ***
## TimeNum:Date6/2/2013  -2.919e+00  4.393e-01  -6.645 3.79e-11 ***
## TimeNum:Date6/3/2013  -1.539e+00  4.263e-01  -3.612 0.000311 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124.6 on 2288 degrees of freedom
##   (7590 observations deleted due to missingness)
## Multiple R-squared:  0.6988, Adjusted R-squared:  0.6933 
## F-statistic: 126.4 on 42 and 2288 DF,  p-value: < 2.2e-16

3.2.2 Model assesment

When performing a linear regression, we indirectly make some assumptions about the modelled data. * We assume that each relationship between one of the covariates and the dependent variable is linear. * We assume that the variance of the model errors is constant. * We assume that errors are independent. * We assume that the errors are normally distributed.

3.2.2.1 Linearity

This assumption seems to be fulfilled. It can be accessed via partial residual plots. The relationship should be indicated by a straight line.

# Linearity
#termplot(audioVifStep,se=T)
termplot(audioVifStep,se=T,partial.resid=T)

3.2.2.2 Constant Error Variance

This can be checked with the Breusch-Pagan test. The null hypothesis is, that the error variance is constant, while the alternative hypothesis is, that the variance varies with the fitted values/ a linear combination of the predictors. If the p-value is higher than 0.05, we do not reject the null hypothesis. So, there seems to be constant error variance.

# Constant Error Variance
require(car)
ncvTest(audioVifStep)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.7215417, Df = 1, p = 0.39564

3.2.2.3 Independence

Usually, this is checked by looking at the errors of the model sorted in observation order. To make sure, that the data used for fitting the model was sorted, we sort the data and refit the model. Than we perform the Durbin-Watson test, where the null-hypothesis is, that we have independence. We can also look at auto-correlation.

There clearly is some correlation. As Independence is a critical assumption, we should handle the results with caution. Some of the significant coefficents might actually be irrelevant. We will find a way to cater for the dependence with Generalised Least Squares.

# Independence
# sort:
merged <- merged[order(merged$timestamp),]
audioVifStep <- lm(miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +
    TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +
    WindDir, data=merged)
plot(1:length(rstandard(audioVifStep)), rstandard(audioVifStep), xlab="Observation Order")

durbinWatsonTest(audioVifStep)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.6861646     0.6256698       0
##  Alternative hypothesis: rho != 0
acf(residuals(audioVifStep))

3.2.2.4 Normality

The errors seem to be normally distributed.

# Normality
hist(rstandard(audioVifStep), main="Model residuals")

3.3 Regression using Generalised Least Squares

It would have been nice to explain the variance of our data with a simple linear regression. As we added Date and Time in our model, we hoped to adress potential dependencies. However, we found that the errors are not independent. To make sure that we can trust our model, we refit it with a correlation-structure.

3.3.1 Model fitting

Let’s try a first order correlation structure.

library(nlme)
audioar1 <- gls(miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +
    TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +
    WindDir, data=merged, correlation=corAR1(form =~1), method="ML", na.action = na.omit)
summary(audioar1)
## Generalized least squares fit by maximum likelihood
##   Model: miks.boden._1 ~ DateNum + TimeNum + HumidityHive + TemperatureHive +      TemperatureOutside + Barometer + HumidyOutside + Rain + WindSpeed +      WindDir 
##   Data: merged 
##        AIC      BIC    logLik
##   28191.03 28265.83 -14082.52
## 
## Correlation Structure: AR(1)
##  Formula: ~1 
##  Parameter estimate(s):
##       Phi 
## 0.6970528 
## 
## Coefficients:
##                        Value Std.Error    t-value p-value
## (Intercept)        -7184.926 1501.3838  -4.785536  0.0000
## DateNum               -7.175    0.4686 -15.312454  0.0000
## TimeNum                0.085    0.1084   0.781113  0.4348
## HumidityHive           0.885    0.4510   1.962738  0.0498
## TemperatureHive      -31.385    6.9141  -4.539281  0.0000
## TemperatureOutside    12.819    2.7600   4.644511  0.0000
## Barometer              9.399    1.4524   6.471680  0.0000
## HumidyOutside          0.853    0.8021   1.063288  0.2878
## Rain                  72.232   33.5628   2.152157  0.0315
## WindSpeed             -0.005    1.0486  -0.004756  0.9962
## WindDir                0.031    0.0216   1.441021  0.1497
## 
##  Correlation: 
##                    (Intr) DateNm TimeNm HmdtyH TmprtH TmprtO Barmtr HmdyOt
## DateNum            -0.172                                                 
## TimeNum             0.065  0.051                                          
## HumidityHive        0.073 -0.121 -0.018                                   
## TemperatureHive    -0.215 -0.304 -0.081  0.208                            
## TemperatureOutside  0.162 -0.243  0.050 -0.048 -0.057                     
## Barometer          -0.987  0.233 -0.057 -0.126  0.067 -0.210              
## HumidyOutside      -0.021 -0.321 -0.075  0.010 -0.011  0.826 -0.034       
## Rain               -0.055 -0.023  0.020  0.020 -0.021 -0.002  0.059 -0.019
## WindSpeed          -0.160 -0.009  0.035 -0.051 -0.019  0.027  0.160  0.109
## WindDir            -0.085  0.057  0.031  0.010  0.005 -0.072  0.086 -0.031
##                    Rain   WndSpd
## DateNum                         
## TimeNum                         
## HumidityHive                    
## TemperatureHive                 
## TemperatureOutside              
## Barometer                       
## HumidyOutside                   
## Rain                            
## WindSpeed           0.053       
## WindDir            -0.025 -0.092
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -3.99783692 -0.63130075 -0.01700845  0.53851352  3.82663588 
## 
## Residual standard error: 141.8792 
## Degrees of freedom: 2331 total; 2320 residual

In order to check our correlation structure, we can look at the ACF plots:

AIC(audioVifStep, audioar1)
##              df      AIC
## audioVifStep 12 29701.45
## audioar1     13 28191.03
#acf(residuals(audioVifStep), main="Residual Indep.")
acf(residuals(audioar1, type="normalized"),
main="Residuals for AR(1) model")

For our previous model we had correlations even for measurements with a distance of 10 time-points. This is much better with the new structure.

3.3.2 Model Assesment

3.3.2.1 Independence

Compared to the previous model, there is not such an obvious pattern in the residuals anymore.

plot(1:length(residuals(audioar1, type="normalized")), residuals(audioar1, type="normalized"), xlab="Observation Order")

3.3.2.2 Normallity

The residuals seem to be normally distributed:

hist(residuals(audioar1, type = "normalized"), main="Model residuals", n=20)

3.4 Summary of Results

The p-values between the models differ quite a bit:

res<- Anova(audioVifStep, type=3)
res1 <- summary(audioar1)

data.frame( "Standard Multiple Regression"= round(res$`Pr(>F)`, digits = 4)[1:10], 
            "Generalised Linear Squares"=round(res1$tTable[,4], digits =4)[2:11],
            row.names = row.names(res)[1:10] )
##                    Standard.Multiple.Regression Generalised.Linear.Squares
## (Intercept)                              0.0000                     0.0000
## DateNum                                  0.0000                     0.4348
## TimeNum                                  0.0000                     0.0498
## HumidityHive                             0.0000                     0.0000
## TemperatureHive                          0.0000                     0.0000
## TemperatureOutside                       0.0000                     0.0000
## Barometer                                0.0000                     0.2878
## HumidyOutside                            0.0111                     0.0315
## Rain                                     0.0015                     0.9962
## WindSpeed                                0.0103                     0.1497

Meanwhile, the estimates tell us a similar story. Here is the coefficient table for the simple linear regression model:

sum <- summary(audioVifStep)
kable(round(sum$coefficients, digits = 4))
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5577.0088 731.1698 -7.6275 0.0000
DateNum -7.2537 0.2148 -33.7671 0.0000
TimeNum 0.3646 0.0767 4.7551 0.0000
HumidityHive 2.2348 0.4482 4.9863 0.0000
TemperatureHive -34.6396 4.0623 -8.5271 0.0000
TemperatureOutside 14.6059 1.3258 11.0165 0.0000
Barometer 7.7682 0.7082 10.9685 0.0000
HumidyOutside 1.0395 0.4091 2.5408 0.0111
Rain 118.8902 37.3316 3.1847 0.0015
WindSpeed 2.5284 0.9846 2.5680 0.0103
WindDir 0.0379 0.0263 1.4387 0.1504

And here the table for the model with correlation structure:

kable(round(res1$tTable, digits = 4))
Value Std.Error t-value p-value
(Intercept) -7184.9265 1501.3838 -4.7855 0.0000
DateNum -7.1747 0.4686 -15.3125 0.0000
TimeNum 0.0847 0.1084 0.7811 0.4348
HumidityHive 0.8852 0.4510 1.9627 0.0498
TemperatureHive -31.3850 6.9141 -4.5393 0.0000
TemperatureOutside 12.8188 2.7600 4.6445 0.0000
Barometer 9.3992 1.4524 6.4717 0.0000
HumidyOutside 0.8528 0.8021 1.0633 0.2878
Rain 72.2324 33.5628 2.1522 0.0315
WindSpeed -0.0050 1.0486 -0.0048 0.9962
WindDir 0.0311 0.0216 1.4410 0.1497

4 Discussion

Let us focus on the results from the GLS model, as it seems to be more stable. It also improved the AIC value from 29 701 to 28 191.

The correlation parameter of the model is 0.7, which means that the mean frequencies for two subsequent 10-minute periods have a correlation of 0.7, measurements 20 minutes apart are assumed to have a correlation of \(0.7^2=0.49\) and so on. This is considered in the error term of the model.

4.1 Date

Overall, the frequency seems to decrease in the time period betwen the 13th of April 2013 and the 7th of June 2013. Roughly speaking, the frequency decreases by 7 Hz every day. This can also be seen from the frequency plotted in timeorder. Only in the second half of June the frequency increases, however we do not have weather data for this period.

  plot(merged$timestamp, merged$miks.boden._1)
Mean Frequency of mice within the hive

Mean Frequency of mice within the hive

4.2 Humidity in the Hive

A one-percent increase in humidity is related to an increase of the frequency by 0.9 Hz.

4.3 Temperature in the Hive

An increase in temperature in the hive by 1 °C corresponds to a decrease of the frequency by 31 Hz.

4.4 Temperature Outside

An increase in temperature outside of the hive by 1 °C corresponds to an increase of the frequency by 12.8 Hz.

4.5 Barometer

A 1 hPa increase in air pressure is related to a 9.4 Hz frequency increase.

4.6 Rain

1 mm of rain results in an increase of the mean frequency by 70 Hz. In fact, we rarely had rainfall higher than 0.2 mm. The people from BeABee spend some effort to build a roof over there hive. However, we cannot tell from our model, if the frequency increase occurs, because the the bees behave differently during rain, or if rainfall was recorded.