How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benchmarks

Triplestores are data management systems for storing and querying RDF data. Over recent years, various benchmarks have been proposed to assess the performance of triplestores across different performance measures. However, choosing the most suitable benchmark for evaluating triplestores in practical settings is not a trivial task. This is because triplestores experience varying workloads when deployed in real applications. We address the problem of determining an appropriate benchmark for a given real-life workload by providing a fine-grained comparative analysis of existing triplestore benchmarks. In particular, we analyze the data and queries provided with the existing triplestore benchmarks in addition to several real-world datasets. Furthermore, we measure the correlation between the query execution time and various SPARQL query features and rank those features based on their significance levels. Our experiments reveal several interesting insights about the design of such benchmarks. With this fine-grained evaluation, we aim to support the design and implementation of more diverse benchmarks. Application developers can use our result to analyze their data and queries and choose a data management system.


INTRODUCTION
The last years have witnessed a significant growth in the use of Linked Data and Semantic Web technologies. This growth has motivated the development of new triplestores with increasingly more efficient RDF storage and SPARQL query processing mechanisms.
Accordingly, various benchmarks [2,5,8,11,14,16,18,24,27,30,31,33] have been proposed to evaluate the querying performance of these triplestores. However, due to heterogeneity of RDF datasets and SPARQL queries, real-world applications often require customized deployments and experience different workloads when deployed in real environments [22]. Thus, designing a one-fit-all benchmark or selecting the most suitable benchmark for any given use-case and workload is not a straightforward task [12,21,24].
This work highlights key features of triplestore benchmarks pertaining to the three main components of benchmarks, i.e., datasets, queries, and performance metrics. State-of-the-art triplestore benchmarks are analyzed and compared against these features. Particularly, we consider triplestore benchmarks that rely on the native capabilities of the triplestores and do not require further reasoning over queries to get complete results. We also analyze the data and query logs of five real-world datasets selected from three different domains so as to provide a comparison with real-world datasets and queries. Our contributions are as follows.
(1) We identify key design features of SPARQL triplestore benchmarks based on our systematic survey on the state-of-the-art. (2) We provide a detailed comparative analysis of the queries and datasets of 11 representative triplestore querying benchmarks: Train Benchmark [30], FEASIBLE [24], WatDiv [2], DBpedia SPARQL Benchmark (DBPSB) [18], FishMark [5], Bowlogna [11], SP2Bench [11], Berlin SPARQL Benchmark (BSBM) [8], BioBench [33], LDBC Social Network Benchmark Business Intelligence workload (SNB-BI) [31], and LDBC SNB Interactive workload (SNB-INT) [14]. (3) We analyze real data and corresponding user queries of five real-world datasets -DBpedia, 1 Semantic Web Dog Food (SWDF), 2 NCBI Gene, 3 SIDER, 4 DrugBank 5 -and compare them with the selected triplestore benchmarks. (4) We measure the impact of various SPARQL query features (e.g., result sizes, triple patterns selectivity and number of join vertices) on the overall query execution time and rank these features according to their significance. In addition, we demonstrate that state-of-the-art triplestore benchmarks vary greatly, and highlight their current limitations. (5) We performed extensive experiments and measure the impact of dataset structuredness (a well-known RDF dataset metric [12] formally defined in Section 2.1.1) on the overall query execution time as well as on the result sizes.
The rest of this paper is organized as follows: We provide an overview of the key RDF datasets and SPARQL query features that need to be considered while designing triplestore benchmarks based on a review of the state of the art. We then present a systematic survey of the current benchmarks for triplestores. The subsequent comparison of selected representative SPARQL benchmarks based on key data and query features identified in the survey is followed by a discussion of our results and some concluding remarks. The data and results presented in this evaluation are available online at https://github.com/dice-group/triplestore-benchmarks. The complete results can be reproduced and new benchmarks can be easily compared using the scripts provided at the project page.

BENCHMARK DESIGN FEATURES
In general, triplestore benchmarks comprise three main components: (1) a set of RDF datasets, (2) a set of SPARQL queries, and (3) a set of performance metrics. This section presents key features of each of these components that are important to consider in the development of triplestore benchmarks. Most of these features originate from state-of-the-art research contributions pertaining to triplestore benchmarks.

Datasets
Datasets used in triplestore benchmarks are either synthetic or selected from real-world RDF datasets [24]. The use of real-world RDF datasets is often regarded as useful to perform evaluation close to real-world settings [18]. Synthetic datasets are useful to test the scalability of systems based on datasets of varying sizes. Synthetic dataset generators are utilized to produce datasets of varying sizes which can often be optimized to reflect the characteristics of realworld datasets [12]. Previous works [12,20] highlighted two key measures for selecting such datasets for triplestores benchmarking: (1) Dataset Structuredness, (2) Relationship Specialty. However, observations from the literature (see e.g., [12,23]) suggest that other features such as varying number of triples, number of resources, number of properties, number of objects, number of classes, diversity in literal values, average properties and instances per class, average indegrees and outdegrees as well as their distribution across resources should also be considered.

Dataset Structuredness:
Duan et al. [12] combine many of the aforementioned dataset features into a single composite metric called dataset structuredness or coherence. This metric measures how well a dataset's classes (i.e., rdf:type) are covered by the different instances of the dataset. The structuredness value for any given dataset lies between [0, 1], where 0 stands for lowest possible structure and 1 points to a highest possible structured dataset. They conclude that synthetic datasets are highly structured while real-world datasets have structuredness values ranging from low to high, covering the whole structuredness spectrum. Formally, dataset structuredness is defined in the form of class coverage. The coverage of a class C, denoted by CV (C), is defined as follows [12]: Definition 2.1 (Class Coverage). Let D be a dataset. Moreover, let P (C) denote the set of distinct properties of class C, and I (C) denote the set of distinct instances of the class C. Let I (p, C) count the number of entities for which property p has its value set in the instances of C. Then, the coverage of the class CV (C) is In general, RDF datasets comprise multiple classes with a varying number of instances for different classes. The authors of [12] proposed a mechanism that considers the weighted sum of the coverage CV (C) of individual classes. For each class C, the weighted coverage is defined below. Definition 2.2 (Weighted Class Coverage). The weighted coverage for a class C denoted by WTCV (C) is calculated as: By using Definitions 2.1 and 2.2, we are now ready to compute the structuredness of a dataset D.

Definition 2.3 (Dataset Structuredness).
The overall structuredness or coherence of a dataset D denoted by CH (D) is defined as 2.1.2 Relationship Specialty. In datasets, some attributes are more common and associated with many resources. In addition, some attributes are multi-valued, e.g., a person can have more than one cellphone number or professional skill. The number of occurrences of a predicate associated with each resource in the dataset provides useful information on the graph structure of an RDF dataset, and makes some resources distinguishable from others [20]. In real datasets, this kind of relationship specialty is commonplace. For example, several million people can like the same movie. Likewise, a research paper can be cited in several hundred of other publications. Qiao et al. [20] suggest that synthetic datasets are limited in how they reflect this relationship specialty. This is either due to the simulation of uniform relationship patterns for all resources, or a random relationship generation process. The relationship specialty of a relationship predicate is defined as follows: Definition 2.4 (Predicate Relationship Specialty). Let d be the distribution that records the number of occurrences of a relationship predicate r associated with each resource and µ is the mean and σ is the standard deviation of d. The specialty value of r denoted as κ (r ) is defined as the Pearson's Kurtosis value of the distribution d.
Where n is the number of available values, i.e., sample size. The relationship specialty of a dataset is defined in the form of a weighted sum of specialty values of all relationship predicates: Definition 2.5 (Dataset Relationship Specialty). The relationship specialty of dataset D denoted by RS (D) is calculated as follows: where |T (r i )| is the number of triples in the dataset having predicate r i , κ (r i ) is the specialty value of relationship predicate r i .
The dataset structuredness and relationship specialty directly affect the result size, the number of intermediate results, and the selectivities of the triple patterns of the given SPARQL query. Therefore, they are important dataset design features to be considered during the generation of benchmarks [12,20,24].

SPARQL Queries
The literature about SPARQL Queries [2,15,23,24,26] suggests that a SPARQL querying benchmark should vary the queries with respect to various features such as query characteristics: number of triple patterns, number of projection variables, result set sizes, query execution time, number of BGPs, number of join vertices, mean join vertex degree, mean triple pattern selectivities, BGP-restricted and join-restricted triple pattern selectivities, join vertex types, and highly used SPARQL clauses (e.g., LIMIT, OPTIONAL, ORDER BY, DISTINCT, UNION, FILTER, REGEX). All of these features have a direct impact on the runtime performance of triplestores. We assume that the reader is familiar with the basic concepts of SPARQL, including the notions of a triple pattern, a basic graph pattern (BGP), and projection variables. 6 In the following, we define the remaining SPARQL features formally, i.e., the number of join vertices, mean join vertex degree, join vertex types, triple pattern selectivities, BGP-restricted and join-restricted triple pattern selectivities.
We represent any basic graph pattern (BGP) of a given SPARQL query as a directed hypergraph (DH) [25], a generalization of a directed graph in which a hyperedge can join any number of vertices. In our specific case, every hyperedge captures a triple pattern. The subject of the triple becomes the source vertex of a hyperedge and the predicate and object of the triple pattern become the target vertices. For instance, the query ( Figure 1) shows the hypergraph representation of a SPARQL query. Unlike a common SPARQL representation where the subject and object of the triple pattern are connected by an edge, our hypergraph-based representation contains nodes for all three components of the triple patterns. As a result, we can capture joins that involve predicates of triple patterns. Formally, our hypergraph representation is defined as follows: The representation of a complete SPARQL query as a DH is the union of the representations of the query's BGPs. Based on the DH representation of SPARQL queries, we can define the following features of SPARQL queries: 6 See https://www.w3.org/TR/sparql11-query/ for the corresponding definitions.

SELECT DISTINCT
is the set of incoming resp. outgoing edges of v. Definition 2.9 (Join Vertex Types). A vertex v ∈ V can be of type star, path, hybrid, or sink if this vertex participates in at least one join. A star vertex has more than one outgoing edge and no incoming edges. A path vertex has exactly one incoming and one outgoing edge. A hybrid vertex has either more than one incoming and at least one outgoing edge or more than one outgoing and at least one incoming edge. A sink vertex has more than one incoming edge and no outgoing edge. A vertex that does not participate in joins is simple. Definition 2.10 (Triple Pattern Selectivity). Let tp i be a triple pattern of a SPARQL query Q and D be a dataset. Furthermore, let N be the total number of triples in D and Card (tp i , D) be the cardinality of tp i w.r.t. D, i.e., total number of triples in D that matches tp i , then the selectivity of and R(BGP, D), respectively, then |{µ ∈ Ω |∃µ ′ ∈ Ω ′ : µandµ ′ are compatible}| |Ω| Definition 2.12 (Join-Restricted Triple Pattern Selectivity). Consider a join vertex x in the DH representation of a BGP. Let BGP ′ belonging to BGP be the set of triple patterns that are incidents to x. Furthermore, let tp i belonging to BGP ′ be a triple pattern and R(tp i , D) be the set of distinct solution mappings of executing tp i over dataset D and R(BGP ′ , D) be the set of distinct solution mappings of executing BGP ′ over dataset D. Then the x-restricted triple pattern selectivity denoted by Sel JVx-Restricted (tp i , D), is the fraction of distinct solution mappings in R(tp i , D) that are compatible with a solution mapping in R(BGP ′ , D) [2]. Formally, if Ω and Ω ′ denote the sets underlying the (bag) query results R(tp i , D) and R(BGP ′ , D), respectively, then All of the above important query features were collected from the previous works [2,15,23,24] in triplestores benchmarking. Finally, we combine all these important query features into a single composite metric called the Diversity Score of the benchmark queries, defined as follows.
Definition 2.13 (Queries Diversity Score). Let µ i be the mean and σ i the standard deviation of a given distribution w.r.t. the i th feature of the said distribution. The overall diversity score DS of the queries is the average coefficient of variation of all the query features k analyzed in the queries of benchmark B:

Performance Metrics
Based on the previous triplestores benchmarks and performance evaluations [2,5,8,10,11,14,16,18,24,27,30,31,33] the performance metrics for such comparisons can be categorized as: • Query Processing Related: The performance metrics in this category are related to the query processing capabilities of the triplestores. The query execution time is the central performance metric in this category. However, reporting the execution time for individual queries might not be feasible due to the large number of queries in the given benchmark.
To this end, Query Mix per Hour (QMpH) and Queries per Second (QpS) are regarded as central performance measures to test the querying capabilities of the triplestores [8,18,24]. In addition, the query processing overhead in terms of the CPU and memory usage is important to measure during the query executions [27]. This also includes the number of intermediate results, the number of disk/memory swaps, etc. • Data Storage Related: Triplestores need to load the given RDF data and mostly create indexes before they are ready for query executions. In this regard, the data loading time, the storage space acquired, and the index size are important performance metrics in this category [8,11,27,33]. • Result Set Related: Two systems can only be compared if they produce exactly the same results. Therefore, result set correctness and completeness is important metrics to be considered in the triplestores evaluations [8,24,27,33]. • Parallelism with/without Updates: Some of the aforementioned triplestores performance evaluations [8,10,33] also measured the parallel query processing capabilities of the triplestores by simulating workloads from multiple querying agents with and without dataset updates.
We analyzed state-of-the-art existing SPARQL triplestore benchmarks across all of the above mentioned dataset and query features as well as the performance metrics. The results are presented in Section 4.

SYSTEMATIC SURVEY
In this section, we present a systematic survey carried out to collect triplestore benchmarks and their selection criteria for further analysis. We conducted a public survey 7 through various relevant W3C Linked Open Data mailing list 8 and Semantic Web mailing list 9 with a request to participate email. We received 14 responses 10 regarding SPARQL triplestore benchmarks. Moreover, we used Google Scholar to retrieve published research work relating to the design of triplestore benchmarks and/or their performance evaluation. Initially, we selected 40 relevant papers 11 and evaluated them against our designed inclusion criteria. In our inclusion criteria we mandated that (1) the benchmark target the query runtime performance evaluation of triplestores, (2) both RDF data and SPARQL queries of the benchmark are publicly available or can be generated (3) the queries must not require reasoning to retrieve the complete results.
After manual evaluation, we found 11 benchmarks (7 with synthetic and 4 with real data) that fulfilled our requirements. The sections below provide details of the selected benchmarks.

Synthetic Triplestore Benchmarks
The Train Benchmark (TrainBench) [30] uses a data generator that produces railway networks in increasing sizes and serializes them in different formats, including RDF. The Waterloo SPARQL Diversity Test Suite (WatDiv) [2] provides a synthetic data generator that produces RDF data with a tunable structuredness value and a query generator. The queries are generated from different query templates. SP2Bench [27] mirrors vital characteristics (such as power law distributions or Gaussian curves) of the data in the DBLP bibliographic database. The Berlin SPARQL Benchmark (BSBM) [8] uses query templates to generate any number of SPARQL queries for benchmarking, covering multiple use cases such as explore, update, and business intelligence. Bowlogna [11] models a real-world setting derived from the Bologna process and offers mostly analytic queries reflecting data-intensive user needs. The LDBC Social Network Benchmark (SNB) defines two workloads: (1) the Interactive workload (SNB-INT) measures the evaluation of graph patterns in a localized scope (e.g., in the neighborhood of a person), with the graph being continuously updated [14], and (2)  graph pattern matching with aggregations, touching on a significant portion of the graph [31], without any updates. Note that these two workloads are regarded as two separate triplestore benchmarks based on the same dataset.

Triplestore Benchmarks Using Real Data
FEASIBLE [24] is a cluster-based SPARQL benchmark generator, which is able to synthesize customizable benchmarks from the query logs of SPARQL endpoints. The DBpedia SPARQL Benchmark (DBPSB) [18] is another cluster-based approach that generates benchmark queries from DBpedia query logs, but employs different clustering techniques than FEASIBLE. The FishMark [5] dataset is obtained from FishBase 12 and provided in both RDF and SQL versions. The SPARQL queries were obtained from logs of web-based FishBase application. BioBench [33] evaluates the performance of RDF triplestores with the real biological datasets and queries from five different real-world RDF datasets 13 , i.e., Cell, Allie, PDBJ, DDBJ, and UniProt. Due to the size of the datasets, we were only able to analyze the combined data and queries of the first three.

Selected Real-World Datasets
As mentioned before, we aimed to analyze the data and queries of real-world datasets and compare them to those of the benchmark datasets and queries. The selection criteria for the real-world datasets were: (1) The RDF datasets must be publicly available, (2) the real queries posted by users of the datasets via SPARQL endpoints should be available. We were able to get real log queries of the Bio2RDF datasets, 14 DBpedia, and Semantic Web Dog Food.
Our goal was to select real-world datasets from different domains. Hence, we selected DBpedia 15 and SWDF and three datasets -NCBI-Gene, Sider, DrugBank from Bio2RDF. The selection of the three Bio2RDF datasets was based on a recommendation from domain experts. The well-known DBpedia dataset is the RDF version of Wikipedia. The SWDF represents the publication from Semantic Web and Linked Data as RDF. NCBIGene provides genetic information from a wide range of species. SIDER contains information on marketed medicines and their recorded side-effects. DrugBank knowledge base contains information about drugs, their composition and their interactions. Table 1 shows statistics from selected datasets of the benchmarks and real-world datasets. More advanced statistics will be presented in the next section. The table also shows the number of SPARQL queries of the datasets included in the corresponding benchmark or query log. It is important to mention that we only selected SPARQL SELECT queries for analysis. This is because we wanted to analyze the triplestore benchmarks for their query runtime performance and most of these benchmarks only contain SELECT queries [24]. For the synthetic benchmarks that include data generators, we chose the datasets used in the evaluation of the original paper that were comparable in size to the datasets of other synthetic benchmarks. For template-based query generators such as WatDiv, DBPSB, SNB, we chose one query per available template. For FEASI-BLE, we generated a benchmark of 50 queries from DBpedia log to be comparable with a well-known WatDiv benchmark that includes 20 basic testing query templates, and 30 extensions for testing. 16

ANALYSIS OF THE BENCHMARKS
We present a detailed analysis of the datasets, queries, and performance metrics of the selected benchmarks and datasets according to the design features presented in Section 2.

Datasets
We presents results pertaining to the dataset features of Section 2.1. Figure 2a shows the structuredness values of the selected benchmarks and real-world datasets. Duan et al. [12] establish that synthetic benchmarks are highly structured while real-world datasets are low structured. This important dataset feature is well-covered in recent synthetic benchmarks such as Train-Bench (with a structuredness value of 0.23) and WatDiv, which lets the user generate a benchmark dataset of a desired structuredness value. However, Bowlogna (0.99), BSBM (0.94), and SNB (0.86) have relatively high structuredness values. The average structuredness value of the selected five real-world datasets is 0.49, and 0.65 for the 13 real-world datasets used in LargeRDFBench [23]. Finally, on average, synthetic benchmarks are still more structured than real data benchmarks (0.61 vs. 0.45).

Relationship Specialty.
According to [20], relationship specialty in synthetic datasets is limited, i.e., the overall relationship specialty values of synthetic datasets are lower than those of similar real-world datasets. The dataset relationship specialty results presented in Figure 2b mostly Figure 2: Analysis of the datasets of triplestore benchmarks and real-world data.
An important issue is the correlation between structuredness and the relationship specialty of the datasets. To this end, we computed the Spearman's correlation between the stucturedness and specialty values of all the selected benchmarks and real-world datasets. The correlation of the two measures is −0.5, indicating a moderate inverse relationship. This means that the higher the structuredness, the lower the specialty value. This is because in highly structured datasets, data is generated according to a specific distribution without treating some predicates more particularly (in terms of occurrences) than others.

Queries
This section presents results pertaining to the query features discussed in Section 2.2. Figure 3 shows the box plot distributions of real-world datasets and benchmark queries across the query features defined in Section 2. The values inside the brackets, e.g., the 0.89 in "BioBench (0.89)", show the diversity score (Definition 2.13) of the benchmark or real-world dataset for the given query feature.
Starting from the number of projection variables (ref. Figure 3a), the NCBIGene dataset has the lowest diversity score (0.16) and SP2Bench has the highest score of 1.14. The mean diversity score (across all benchmarks and real-world datasets) for this feature is 0.59 and hence the diversity scores of DBPSB, SNB-BI, SNB-INT, WatDiv, and Bowlogna are below the average value. Even though the diversity score of BSBM is above average, the distribution shows that the values mostly lie in the second quartile of the box plot. The average diversity score of the number of join vertices (ref. Figure 3b) is 1.39 and hence the diversity scores of the Bowlogna, FishMark, WatDiv, BSBM, TrainBench, BioBench, DBPSB, SNB-BI, and SNB-INT benchmarks are below the average value. It is important to mention that the highest number of join vertices recorded in a query is 51 in the SNB-BI benchmark. The average diversity score of the number of triple patterns (ref. Figure 3c) is 0.75 and hence the diversity scores of the FishMark, Bowlogna, BSBM, and WatDiv benchmarks are below the average value. The average diversity score of the result sizes (ref. Figure 3d) is 11.89 and hence the diversity scores of all benchmarks are below the average value. The average diversity score of the join vertex degree (ref. Figure 3e) is 1.08 and hence the diversity scores of all benchmarks except FEASIBLE are below the average value. The average diversity score of the triple pattern selectivity (ref. Figure 3f) is 3.17 and hence the diversity scores of all benchmarks except FEASIBLE are below the average. The average diversity score of the joinrestricted triple pattern selectivity (ref. Figure 3g) is 1.39 and hence the diversity scores of all benchmarks except FEASIBLE and BSBM are below the average value. The average diversity score of the BGP-restricted triple pattern selectivity (ref. Figure 3h) is 4.11 and hence the diversity scores of all benchmarks except WatDiv are below the average value. The average diversity score of the number of BGPs (ref. Figure 3i) is 0.63 and hence the diversity scores of SNB-BI, SNB-INT, BSBM, SP2BENCH, TrainBench, WatDiv, and Bowlogna are below the average value.
The Linked SPARQL Queries (LSQ) [22] representation stores additional SPARQL features, such as use of DISTINCT, REGEX, BIND, VALUES, HAVING, GROUP BY, OFFSET, aggregate functions, SERVICE, OPTIONAL, UNION, property paths, etc. We make a count of all of these SPARQL operators and functions and use it as a single query dimension as number of LSQ features. The average diversity score of the number of LSQ features (ref. Figure 3j) is 0.30, and hence only the diversity scores of SNB and WatDiv are below average value. Finally, the average diversity score of the query runtimes is 12.87 (ref. Figure 3k), and hence the diversity scores of all benchmarks are below average value.
In summary, FEASIBLE's diversity scores are below the average values in 3 of the 11 features, followed by BioBench with 7/11. These are followed by SP2Bench, TrainBench, and BSBM with 8/11 each. The next is FishMark 9/11 and then Bowlogna, WatDiv, SNB-BI, SNB-INT, and DBPSB with 10/11 each. Figure 3l shows the overall (across all the features, ref. Definition 2.13) diversity scores of the benchmarks and real-world datasets. In the benchmarks category, FEASIBLE produces the most diverse benchmarks (diversity score 2.  Table 2 shows the percentage coverage of widely used [22] SPARQL clauses and join vertex types for each benchmark and real-world dataset. We highlighted cells for benchmarks that either completely miss or overuse certain SPARQL clauses and join vertex types. TrainBench and WatDiv queries mostly miss the important SPARQL clauses. All of FishMark's queries contain at least one "Star" join node. The distribution of other SPARQL clauses, such as subquery, BIND, aggregates, solution modifiers, property paths, and

Performance Metrics
This section presents results pertaining to the performance metrics discussed in Section 2.3. Table 3 shows the performance metrics used by the selected benchmarks to compare triplestores. The query runtimes for complete benchmark's queries is the central performance metrics and is used by all of the selected benchmarks. In addition, the QpS and QMpH are commonly used in the query processing category. We found that in general, the processing overhead generated by query executions is not paid much attention as only SP2Bench measures this metric. In the "storage" category, the time taken to load the RDF graph into triplestore is most common. The in-memory/HDD space require to store the dataset and corresponding indexes did not get much attention. The result set correctness and completeness are important metrics to be considered when there are large number of queries in the benchmark and composite metrics such as QpS and QMpH are used. We can see many of the benchmarks do not explicitly check these two metrics. They mostly assume that the results are complete and correct. However, this always might not be the case [23]. Additionally, only BSBM considers the evaluation of triplestores with simultaneous user requests with updates. However, benchmarks execution frameworks such IGUANA [10] can be used to measure the parallel query processing

Impact of Dataset Structuredness
The dataset structuredness has been regarded as one of the most important RDF dataset feature [12]. However, to the best of our knowledge, the impact of dataset structuredness on query runtimes and result set sizes has not been measured in the literature. To measure this impact, we need to create synthetic datasets of varying structuredness values, all following the same dataset description model or schema and comparably to each others in terms of their sizes. We then need to execute the same set of benchmark queries over the generated datasets and measure the result set sizes and query runtimes. The WatDiv benchmark data generator allows the control of the size of generated datasets as well as the structuredness values of individual entities. However, the task of generating the datasets with exact sizes and structuredness values are difficult to achieve due to the scaling and structuredness factors used to control the size and structuredness of the overall dataset. We generated 10 datasets of varying structuredness values given in Table 4. We run complete WatDiv queries on the individual datasets by using Virtuso triplestore and measured the result set sizes and runtimes of individual queries. Figure 4 shows the impact of the dataset structuredness values on query runtimes and result set sizes. We can clearly see there is a positive correlation of dataset sturcturedness values and the query runtimes as-well-as result sizes, i.e. the higher the structuredness value of the dataset the higher the result sizes and query runtimes. The result also suggests that there is a direct correlation with the result sizes and the query runtimes.

Correlation of Query Features vs. Runtimes
In previous sections, we presented the results of some important SPARQL query features that should be considered while designing SPARQL benchmarks. These features were mostly taken from previous works [2,18,24] in SPARQL benchmarking. However, an  Table 5. Note that this table presents combined results obtained from Virtuoso and FUSEKI triplestores. We choose two triplestores as the query planner of the triplestore can greatly affect the query runtimes and hence biased the result towards a particular query planner. The overall results show that the number of projection variables (correlation 0.32) has the highest impact and BGP-restricted triple pattern selectivity (correlation 0.00) has the lowest impact on query runtimes. Yet, the result suggests that there is no single query feature that has a strong or very strong correlation with the query runtimes. This further suggests that the overall query runtime is impacted by a combination of different features.

RELATED WORK
Graph structure vs. query performance. The correlation between query runtime and workload metrics were studied in [17]. In addition to standard metrics, the authors introduced composite metrics such as the absolute difficulty (logarithm of the search space size), and the relative difficulty, which expresses how much worse a query engine does than the theoretical lower bound required by a certain query. The authors generated 12 graphs with different degree distributions along with 25+ queries of different shapes and measured their execution times. Then, they calculated the Kendall's τ rank correlation coefficient for p < 0.001 between each metric and the query execution times. The strongest correlation (+0.38) was exhibited by the absolute difficulty metric. The goal of gMark [4] is to define a schema-driven workload generator that synthesizes graph instances and queries for a given schema. The approach relies on controlling the diversity of the generated graphs and the difficulty of the generated workloads, using a selectivity estimation algorithm, which guarantees the selectivity of (certain) generated queries. The flexibility of their approach is demonstrated by generating workloads based on existing RDF benchmarks (SNB, SP2Bench, WatDiv).
Characterization of typed graphs. While this paper focuses on determining the correlation query execution time and graph metrics such as relationship specialty and structuredness (Sec. 2.1.1), other   metrics could also provide valuable insight. In particular, the tools of network science are often used to uncover the structural interplay between the nodes of a certain graph [9], which in turn could be used to predict the difficulty of querying such graphs. However, most works in the field only target the characterization of homogeneous, untyped networks, which omit a great deal of valuable information when trying to understand the structure of typed networks such as RDF graphs. Only recent research targeted the understanding of typed graphs, referred to as "multilayer", "multiplex", or "multidimensional" networks. The authors of [7] generalized the degree distribution to take edge types into account and introduced a set of typed connectivity metrics. Extending this work, the authors of [19] defined a set of additional metrics to characterize the effect of types on nodes and edge pairs. Meanwhile, paper [6] introduced variants of the local clustering coefficient that described the ratio of typed triangles in the network. Survey [1] summarizes the stateof-the-art on analyzing multilayer (edge-typed) networks. Typed graph metrics were used successfully in the context of model-driven engineering to describe the structure of system models and distinguish real graph models from synthetic ones [29,32]. However, these were limited to graphs containing at most 1M nodes.
In the fields of semantic web and database engineering, a set of simple typed graph metrics were proposed in the context for social network analysis in [13]. The concept of "meta-path", a path with a given sequence of node/edge types and its related literature were investigated in survey [28]. Duan et al. [12] presented the structuredness values of several real-world datasets and synthetic benchmarks. RBench [20] introduced the relationship specialty metric and compared real datasets with datasets synthesized by data generators. WatDiv [2] introduced the Join-restricted and BGPrestricted triple pattern selectivities as important query features.
Our work. In this paper, we conducted a systematic survey to collect a current list of triplestore benchmarks. We collected important query and dataset features from state-of-the-art [2,12,20,24] and added additional important query features such as the number of projection variables, the number of BGPs, the number of LSQ features, etc. We compared 11 triplestore benchmarks and 5 real-world datasets across the identified important data and query features. We also measured the correlation of the identified query features with the overall query runtime. To the best of our knowledge, there exists no such detailed analysis of triplestore benchmarks.

CONCLUSION AND FUTURE WORK
We performed a comprehensive analysis of existing benchmarks by studying synthetic and real-world datasets as well as by employing SPARQL queries with multiple variations. Our evaluation results suggest the following: (1) The dataset structuredness problem is well covered in recent synthetic data generators (e.g., WatDiv, TrainBench). The low relationship specialty problem in synthetic datasets still exists in general and needs to be covered in future synthetic benchmark generation approaches; (2) The FEASIBLE framework employed on DBpedia generated the most diverse benchmark in our evaluation; (3) The SPARQL query features we selected have a weak correlation with query execution time, suggesting that the query runtime is a complex measure affected by multidimensional SPARQL query features. Still, the number of projection variables, join vertices, triple patterns, the result sizes, and the join vertex degree are the top five SPARQL features that most impact the overall query execution time; (4) Synthetic benchmarks often fail to contain important SPARQL clauses such as DISTINCT, FILTER, OPTIONAL, LIMIT and UNION; (5) The dataset structuredness has a direct correlation with the result sizes and execution times of queries and indirect correlation with dataset specialty. As future work, we endeavour to broaden the scope of our analysis by adding more types of SPARQL query benchmarks (e.g., using reasoning, querying streams) and investigating more typed graph metrics on benchmark data sets.