When starting a new project that will eventually contain time-series data (logging user activity, system logs, etc.), one needs to make a decision on which storage to use for this information. Elasticsearch, sometimes within the ELK Stack (Elasticsearch, Logstash, Kibana), is a popular solution for both the storage and visualization of such data. However, Elasticsearch’s primary function is as a document indexing and search engine, raising doubts as to how well it performs in handling time series data. To address these doubts, we’re going to compare Elasticsearch with another database popularly used for this purpose—InfluxDB.
What Is Time Series Data?
Time series data can be defined as data points indexed by their temporal order, where the distance between two data points may or may not be equal. If the frequency at which data points are taken is constant (e.g., sampling the data every 10 ms) then the series is called a discrete time data series.
In computer systems, all user data can potentially be represented as time series data, as all stored information has a time component that can provide different metrics in different scenarios. For example, Twitter, Facebook, and LinkedIn have data on the user’s registration date, as well as the dates and times at which various actions performed (tweet or article posted, activity liked, etc.).
Even though the data’s time component is important in such scenarios, there are some other use cases in which it is crucial, as metrics calculated from this data are based largely on time intervals. Some relevant examples include the tracking of user activities by Google Analytics and Netflix, or metrics tracking the function of running systems (e.g., JMX, operating system or network statistics).
Obviously, having data arriving at a higher frequency can create challenges, including having to handle a greater number of write requests per second and needing to store all the data. One sensor, with a sampling frequency of 30 requests per second and a payload of 1KB, can generate 86MB of information each day, meaning 100 sensors would create a data load of 8GB per day. Querying and aggregating such a large amount of data to extract useful information is another issue to be considered. Deciding on the right storage engine to use for time series data is one of the first challenges to overcome when designing a temporal data-generating system.