Motivations and options for clustering

Currently, the QPipeline considers the significance of triggers independently. The detectability of a particular GWB signal therefore depends upon its maximum single projection onto the space of Gaussian enveloped sinusoids. As a result, the QPipeline is most sensitive to signals with near minimum time-frequency uncertainty, and less sensitive to signals that are extended in time and/or frequency such that their energy is spread across multiple non-overlapping basis functions. For example, the detectability of the simulated inspiral signal shown in Figure 1 is currently determined by the single most significant tile near the center of the signal, which has a single tile SNR of 12.7. This is significantly less than the SNR of 48.2 that is recovered by a matched filter search tuned for this waveform.

Although the above example focuses on binary neutron star inspiral waveforms, matched filter search methods are more appropriate to search for such signals, since there waveform is well understood [15,16,17,18]. We focus on them here only because they represent an astrophysically interesting example case of a signal which is extended in time and frequency. The QPipeline is not intended to search for inspiral signals, but instead focuses on the search for other transient sources such as the less well understood merger phase of coalescing binary compact objects, core collapse supernovae, and instabilities in spinning neutron stars, many of which are also expected to produce waveforms that are extended in time and/or frequency [19,20,21,22]. As a result, we seek a method to improve the sensitivity of the QPipeline to signals that are extended in time and/or frequency that is applicable to the general case of astrophysically unmodeled bursts, and is not specific to any one particular waveform.

An obvious solution is to simultaneously consider the aggregate significance of all tiles that comprise the signal. This requires an approach that identifies clusters of related tiles in the time-frequency plane. In this context, we define clustering as the grouping of the set of all significant QPipeline tiles into subsets, such that all tiles within a subset are closely related by their relative distance in the time-frequency plane.

There are many different clustering methods available; however, they all tend to fall into three categories [11].

**Figure:** Two different clustering methods were applied to the QPipeline triggers from Figure 1. The *left panel* image shows the result of applying a hierarchical based clustering method [11], while the *right panel* image shows the result of applying the proposed density based clustering method [23]. Here each combination of color and shape denotes a different cluster. Although both hierarchical and density based clustering approaches succeed in isolating most of the signal energy within a single cluster, the density based approach has the additional advantage of discarding isolated triggers due to the background detector noise.
$\includegraphics[angle=0,width=75mm]{figures/003_hiclustered}$ $\includegraphics[angle=0,width=75mm]{figures/004_dbclustered}$

Partitioning methods The classical example of a partitioning method is the K-means algorithm [24]. In K-means clustering, a fixed number of clusters presupposed, and an initial guess to partition objects into these clusters is made. A centroid is computed for each cluster, and the total sum distance of all objects from their cluster centroid is computed as a figure of merit. K-means iteratively reassigns objects to different clusters until this figure of merit is minimized.

There are a number of drawbacks to the K-means approach, as described in [24,11]. The first is that it presupposes a fixed number of clusters, though there are variations that allow the number of clusters to change [25]. The other difficulty with the K-means approach is the tendency to produce spherical or ellipsoidal clusters rather than more complicated arbitrary shapes. This is acceptable for some signal morphologies but not so in general. For example, the signal expected from inspiralling binary compact objects, which is long and extended in time and frequency, would not be easily identified by K-means clustering [25]. A third difficulty with the K-means approach is the sensitivity to the initial guess, and the possibility of the algorithm to identify a local rather than global minimum [11].

Hierarchical methods This type of clustering algorithm first evaluates the pairwise difference between all

objects, then arranges them into a tree structure, where each object to be clustered is a leaf [26]. In the agglomerative hierarchical approach, the tree is constructed in

levels. At each level, the closest pair of leaves and/or branches is merged, with the

th level representing a cluster of all objects. A cluster is formed by cutting this tree based on some criteria such as a maximum distance or the inconsistency between the distance between cluster leaves or branches and the distance to the next closest leaf or branch.

Hierarchical clustering has the flexibility of producing arbitrary numbers of clusters with arbitrary shape, and is therefore more applicable to the problem of clustering in time, frequency, and Q for gravitational-wave burst detection.

The left panel from Figure 2 shows an example of hierarchical clustering, implemented using the functions ``linkage'' and ``cluster'' from MATLAB Statistics Toolbox applied to the detection of a simulated binary neutron star inspiral signal. The draw back to this approach is that it presupposes that every data point must be included in one cluster or another, even if a cluster is to be constituted of only one data point. As a result, it produces a large number of insignificant clusters and tends to build clusters of unrelated data points.

Density based methods Density based clustering [23] is a variation on the hierarchical clustering approach. Instead of constructing a tree structure, density based clustering starts with single object and iteratively adds objects to that cluster using a predefined set of criteria based on the density of objects within a given neighborhood radius, until the criteria is no longer met and all objects have been tested.

Like other hierarchical methods, density based methods also allow an arbitrary number of clusters as well as arbitrary cluster shape. They also have the advantage of rejecting single isolated data points that are not potentially related with a large number of points. Thus this method only produces clusters with multiple data points and can successfully exclude unrelated points from a cluster. For this reason, we have focused on density based methods in this work.