Various aspects of working with time series data — Part 1: Time formats
This article is part of a series of articles aimed to discuss issues that every person working with Time Series data (especially data scientists) should know.
Introduction
Time Series (TS) is one of the most common data types. It is a sequence of data points indexed in time order. These data points typically consist of successive measurements made from the same source over a fixed time interval and are used to track change over time. Some examples for time series data are: sensor data, stock prices, heart rate, weather conditions, precipitation, retail sales, and more. TS data differs from other data types in a few ways. The main difference is that the data points are not independent. They relate to each other and their order is important. In this sense, TS data is similar to text. Another difference is that in TS data we are dealing with time. Time has it’s own rules and it is not the same at different places. There are many limitations and parameters to consider when dealing with TS data, and I hope to cover most of them in this series.
The topics discussed in this series are of different levels of depth and complexity. If you are just starting your journey with TS data, you might take interest in the whole series, and, if you are already familiar with TS data, you might want to skip to the later parts, that include more advanced topics.
All of the code examples and discussions are in Python.
Articles in this series
Part 1: Time formats
Part 2: Time series analysis — Seasonality, Trends, and Frequencies
Part 3: Anomalies, Motifs, and Signatures
Part 4: Time series tasks and relevant algorithms
Table of contents
· Introduction to time formats
· Converting between formats and watching for timezones
· Timezone awareness
· Type conversions
· Working with time
· Pros and Cons of the different formats
· Summary
Introduction to time formats
Time can be displayed in a few different formats. The three must common formats are Datetime, Time Tuple, and Unix (or POSIX) Timestamp.
The Datetime format shows the date and the time as we are used to see it in our everyday life (see example below), and it may include a few or more of the following: Year, Month, and Day (from the Gregorian Calendar), Hour, Minute, Second, and millisecond of the day.
The Time Tuple format shows time as integers, separated by commas, inside a tuple, and it can have the same date and time objects as the Datetime format (see example below).
Unix Timestamp on the other hand, looks completely different (see example below). It is an integer that contains up to ten digits, and while converting it in our head to a familiar date format is not very easy, understanding the meaning of these numbers actually is. Unix Timestamp simply counts the number of seconds that passed since 01/01/1970 on 00:00 UTC (called the Unix Epoch).
Why should we use this method? that will become clearer when we will discuss time zones and format conversions.
# An example of the Datetime format:
Timestamp('2023-06-25 18:10:53')
# An example of the Time Tuple format:
(2023, 6, 25, 18, 10, 53)
# An example of the Unix Timestamp format:
1687716653
Converting between formats and watching for timezones
Converting from Unix Timestamp to Datetime can be done in a few ways. Here are two common ones:
# The first way - Using the pandas library (DatetimeIndex)
import pandas as pd
pd.to_datetime(1687716653, unit='s')
>> Timestamp('2023-06-25 18:10:53')
# unit = 's' tells the method to treat the integer as a counter of seconds.
# The second way - Using the datetime library
from datetime import datetime
datetime.fromtimestamp(1687716653)
>> datetime.datetime(2023, 6, 25, 21, 10, 53)
As you see the output format of pandas and datetime is different, but more intriguing is the fact that they return two completely different times for the same Unix Timestamp. Why is this happening? We will discuss it in a bit.
Before that, let’s see some examples of converting back to Unix Timestamp:
dt = pd.to_datetime(1687716653, unit='s')
dt.timestamp()
>> 1687716653.0
tt = datetime.fromtimestamp(1687716653)
tt.timestamp()
>> 1687716653.0
*datetime.timestamp() would get the same result.
**The Unix Timestamp can also appear in float type with zero after the decimal point, as shown above.
Once again, we can see that I have entered two different times and got the same Unix Timestamp. So let’s explain why it happens and what should we do if we want to prevent it.
One of the most confusing and critical factors to notice when dealing with TS data is timezones. It may sound rather trivial but it is actually very tricky. So what happens in the above example? While pandas method (to_datetime) converts the timestamp into UTC, the datetime method (fromtimestamp) converts it to the local timezone (as set by the computer). The latter is problematic if not noticed, and can cause us problems. Since the Unix timestamp relates to UTC, it might be easier to convert to Datetime format and see the UTC time instead of the local time.
***UTC is the coordinated universal time, and the different timezones around the world are all relative to it. If you want to read more: UTC.
Timezone awareness
But wait, why does the .timestamp() converts the two different times into the same Unix timestamp? In order to answer that we will have to meet the subject of timezone awareness. Basically, every Datetime object we have can be found in one of two states: timezone-aware or timezone-naive. Timezone naive means that it does not relate to any specific timezone, while timezone aware means we are assigning it with a certain timezone. How can we use it? Let’s look at the following example:
# Pandas has a "shortcut" to create UTC awarness:
pd.to_datetime(1687716653, unit='s', utc=True)
>> Timestamp('2023-06-25 18:10:53+0000', tz='UTC')
# And, you can also make it aware with other time zones:
london_aware = pd.to_datetime(1687716653, unit='s').tz_localize(tz='Europe/London')
print(london_aware)
>> Timestamp('2023-06-25 18:10:53+0100', tz='Europe/London')
london_aware.timestamp()
>> 1687713053.0
As you can see, two elements are added: tz (continent and city of the timezone as a string), and a ‘+’ sign added after the time which tells the time difference between the timezone and UTC. The time itself does not change, it just becomes timezone-aware.
Now let’s go back to the previous example. When we convert timezone-aware Datetime object back to the Unix format, the timestamp changes because it is converted back to UTC time.
Now let’s look at how to do it with the datetime library:
# Here we will need another library called pytz:
import pytz
datetime.fromtimestamp(1687716653, tz=pytz.timezone('Europe/London'))
>> datetime.datetime(2023, 6, 25, 19, 10, 53, tzinfo=<DstTzInfo 'Europe/London' BST+1:00:00 DST>)
As shown above, the time itself is actually already converted to the timezone we specified (and not automatically to our local timezone).
Type conversions
So far, we’ve discussed time formats, but sometimes we may encounter time as other types of variables like a string or a tuple for example. Let’s see some examples and how we can convert them into Datetime type using the datetime library:
# At times we may see a tuple where the items are the year, month, day, and so on:
tup = (2023, 6, 25, 21, 10, 53)
# We can convert it into a time tuple with datetime and the addition of an asterisk:
datetime(*tup)
>> datetime.datetime(2023, 6, 25, 21, 10, 53)
# If our time is in a string, we can convert it to a datetime object as well:
str = '25/06/2023 18:10:53'
datetime.strptime(str, format='%d/%m/%Y %H:%M:%S')
>> datetime.datetime(2023, 6, 25, 21, 10, 53)
# Notice that we must add the accurate locations of the time parameters in the "format" parameter.
# If we want, we can also change it back to a string:
dt = datetime.strptime(str, format='%d/%m/%Y %H:%M:%S')
dt.strftime("%d/%m/%Y %H:%M:%S")
>> '25/06/2023 18:10:53'
Why should we bother to convert these variable types into a time variable? there is a very good reason for that.
Working with time
When analyzing TS data, working with time variables can allow us to do some interesting stuff. I will show some them here.
- Calculating the time difference (a.k.a time delta):
# With Unix timestamp, the difference between two time stamps is their substraction:
t_delta = 1687716653 - 1687615412
print(t_delta)
>> 101241
# It can than be converted to minutes, hours, days, or any other time parameter:
minutes = t_delta / 60
hours = t_delta / 3600
days = t_delta / 86400
print(minutes, hours, days)
>> 1687.35 28.1225 1.17177
# With datetime objects we get a timedelta object:
a = pd.to_datetime(1687716653, unit='s')
b = pd.to_datetime(1687615412, unit='s')
a - b
>> Timedelta('1 days 04:07:21')
c = datetime.fromtimestamp(1687716653)
d = datetime.fromtimestamp(1687615412)
c - d
>> datetime.timedelta(days=1, seconds=14841)
# It can also be converted to different time parameters:
e = datetime.fromtimestamp(1687716653) - datetime.fromtimestamp(1687615412)
e.days
>> 1
e.seconds
>> 14841
2. Extracting different time parameters:
# We can extract only specific parameters that interest us.
# All we need to do is to call the parameter from the time object (e.g. year, month, etc.)
dt = datetime.fromtimestamp(1687716653)
print(dt.year, dt.month, dt.day, dt.hour, dt.minute, dt.second)
>> 2023 6 25 18 10 53
# Or with pandas time object (exactly the same procedure):
pdt = pd.to_datetime(1687716653, unit='s')
print(pdt.year, pdt.month, pdt.day, pdt.hour, pdt.minute, pdt.second)
>> 2023 6 25 18 10 53
# We can also use the date to identify the day of the week:
print(dt.weekday(), pdt.weekday())
>> 6 6
# 0 = Monday, 1 = Tuesday, 2 = Wednesday, 3 = Thursday, 4 = Friday, 5 = Saturday, 6 = Sunday
# The conversion to the name of the day (as a string) can be done like this:
import calendar
print(calendar.day_name[6])
>> Sunday
3. Pandas has some nice methods to work with time:
# First, lets see how we can convert our time column (pandas series) between
# Unix timestamp and datetime formats
# Unix to Datetime:
df['TIME_COLUMN_NAME'] = pd.to_datetime(df['TIME_COLUMN_NAME'], unit='s')
# Datetime to Unix (this could be considered a "trick" as we are reducing the
# time up to 01-01-1970 and deviding it to seconds, which is exactly what Unix timestamp is):
df['TIME_COLUMN_NAME'] = (df['TIME_COLUMN_NAME'] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
# Now let's create a dataframe and start working
dates = pd.date_range('1/1/2023', periods=200, freq='D')
values = np.random.rand(200, 3)
df = pd.DataFrame(values, index=dates, columns=['One', 'Two', 'Three'])
df.head()
### freq and .asfreq ###
# If we want to check the frequency of our time series:
df.index.freq
>> <Day>
# But, if the frequency is not consistent we will get None. So, in that case we
# can use asfreq to determine the frequency ourselves, and fill the added records
# with nulls:
df.asfreq('D')
# Notice that this method is performed on the entire dataframe, but it requires a datetime index.
# We can also fill the added records with other values if we want.
### Day of the week ###
# As demonstrated before we can get the numeric representation of the day of
# the week by using weekday, but we can also get the name of the day as a string:
df['numeric_day'] = df.index.weekday
df['day_name'] = df.index.day_name()
df.head()
### difference and shifting ###
# Many times when analyzing TS data we may want to know the difference between
# consecutive measurements. this can be done by diff():
df[['One', 'Two', 'Three']].diff(1)
# The number inside the parenthesis will determine the comparison target (how
# many lookback steps).
# Other times we may simply want to shift the values:
df[['One', 'Two', 'Three']].shift(1)
### Truncate ###
# We may want to choose a slice of the data frame based on dates or time. This
# can be easily done by truncate():
df.truncate(before=pd.Timestamp('2023-02-05'), after=pd.Timestamp('2023-02-10'))
### Resampling ###
# Now we have daily measurments. What if we want to convert it to weekly measurements?
# We might want to use our data to create the weekly average. There is a simple
# way to do it:
df[['One', 'Two', 'Three']].resample('W').mean()
# Obviously, there can be many other frequencies and statistical factors to calculate.
### Rolling window ###
# This is an oftenly used method to assist in making calculations on a rolling
# window with a fixed size. Let's try for example to calculate the moving average
# of a three days window:
df[['One', 'Two', 'Three']].rolling(3).mean()
# Other statistical factors can be calculated as well.
### Visualizations ###
# I'm not really going to go into the world of time series visualizations as it
# has many options and deserves an article of its own. I just want to present the
# option of using .plot() as simple, easy way to visualize the data with pandas:
import matplotlib.pyplot as plt
df[['One', 'Two', 'Three']].plot(figsize=(15,6))
plt.show()
# This can be further designed with the matplotlib library tools.
# (Apologies in advance for the boring graphs to come)
# It can also be used for creating bar charts:
df['day_name'].value_counts().plot.bar()
plt.show()
# Or to create histograms:
df['One'].hist(bins=50)
plt.show()
Note: Pandas has much more to offer but this pretty much covers the basics. If you want, you are welcome to dig deeper and find more methods on web.
Pros and Cons of the different formats
So what is the best format to work with? the answer is complicated and depends on personal needs as each format has its pros and cons:
Obviously, the formats can also be used interchangeably. For example, using the Unix Timestamp in the database and backend side, and Datetime for time related operations and dashboards. But, in these cases we must be careful when converting between formats.
Summary
- When working with TS data, there are different time formats we can use to present our data.
- Different time formats have their pros and cons. Ideally, you should choose the one best suited for your needs and try to stick with it in order to avoid conversions.
- That being said, although conversions are risky, in many cases, they are necessary. When we convert between formats we should make sure that our time objects are timezone-aware, which will help us avoid unfortunate mistakes in the process.
- There are some designated methods we can use that can help us work with TS data more efficiently.
If you liked this article, please give it a clap. If you want, you can also follow me to see more of my content.
To read the next article in the series: Part 2: Time series analysis — Seasonality, Trends, and Distributions