Logs! Can it be this simple? If you have spent any time developing and administrating software services, you have struggled with the pain of dealing with logs in production. Every app formats them differently. There are modern tools (Splunk, ELK stacks) to help aggregate data together, but this becomes highly challenging. I for one am tired of this problem.
The 12-factor app has suggested logs are time-series event streams with one event per line, and the Idiomatic Container Patterns reinforce this assertion. Along with this has come the movement that perhaps all logs should be in JSON format.
As somebody who has visited this road a few times, I can suggest it is not the final solution. It has two problems:
Sometimes as a human, you want to read your logs without a tool that is parsing them. When everything is JSON, this is rather painful.
Getting our software to be 100% pure JSON is extremely difficult. To bring our software to 98% is fairly easy, but the last 2% is painful. And most systems that parse log data have a problem when you mix JSON and non-JSON in the stream (as they should).
JSON is dense, adding a lot of extra data to your feeds — this increases log system sizes and data feeds.
I suggest instead of trying to push all the way to JSON, perhaps we take a step back and just address it at a simpler level.
Fundamentally every log can fit this universal format, with one message per line (no line wrapping):
TIMESTAMP message with spaces key=value key=value... ^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^ ^^^^^^^^^ `- 1 `- 3 `- 2 ...
- Timestamp: Use ISO 8601:2004. We've standardized this. Move along.
- Starting from the end of the line are key=value assertions, in any order. These are free form, you can put anything into them. The key name is letter characters, no symbols or spaces (i.e. a-z or international equivalents). The value is either:
- Any character, not a space or quote, until a space is reached.
- Starts and ends with an ASCII quote (" / x22), and is encoded as a JSON string. Newlines are encoded as \n (remember, no line wrapping).
- After the key=value assertions have been removed, what remains is a human readable message (if any). This can be blank, with only key value pairs on the message. The logging service on the backend should store the resulting message as
msgunless that value is already defined, then store it as
Anything one would normally want to embed into a log message, such as the originating host, log level, etc, can be added as a key=value pair.
Furthermore, key=value pairs should be kept to a minimum, with the context of the time-series being used — and if possible request ID's — to help associate data together across lines. I.e. do multiple lines with more succinct message, rather than one long line with extra data (in some manner this is like avoiding transitive dependencies in a database).
If, when using this format, something else bleeds through that isn't quite compliant, it isn't as disastrous as it is when trying to do a JSON pure format.
The problem is, we make things more complex than we need to by having so many varieties of logging formats, and frameworks make it difficult to standardize output. Even the tools we use, like
logging in python, make it extremely difficult to wrangle the format yourself. So instead I ask, can we get everybody to work towards a more simplified universal format?
Consider some http logs formatted in this manner, which can now be interspersed with other log data (allowing one log file instead of multiples):
2017-07-31T11:35:55:00Z type=http status=204 query="GET /api/health HTTP/1.0" reqid=6b5a7132 len=- remote=10.63.50.245 2017-07-31T11:36:01:00Z type=http status=200 query="GET /api/service/object-1 HTTP/1.0" reqid=6b5a7133 len=453 remote=10.5.1.77 2017-07-31T11:36:02:00Z type=http status=200 query="GET /api/service/object-2 HTTP/1.0" reqid=6b5a7134 len=440 remote=10.5.1.77 2017-07-31T11:37:02:00Z JWT expired type=auth status=failed reqid=6b5a7134 2017-07-31T11:36:02:00Z type=auth status=success reqid=6b5a7136 apikey=4241 2017-07-31T11:36:02:00Z type=http status=200 query="GET /api/login/pond HTTP/1.0" reqid=6b5a7136 len=440 remote=10.5.1.77 2017-07-31T11:36:02:00Z type=http status=200 query="GET /api/service/object-2 HTTP/1.0" reqid=6b5a7134 len=440 remote=10.5.1.77 2017-07-31T11:37:02:00Z type=error traceback="Traceback (most recent call last):\n File \"<doctest...>\", line 10, in <module>\n lumberjack()\n File \"<doctest...>\", line 4, in lumberjack\n bright_side_of_death()\nIndexError: tuple index out of range\n" 2017-07-31T11:36:02:00Z type=http status=200 query="GET /api/instance/object-3 HTTP/1.0" reqid=6b5a7135 len=239 remote=10.5.1.77 2017-07-31T11:36:02:00Z type=http status=200 query="GET /api/instance/object-4 HTTP/1.0" reqid=6b5a7136 len=240 remote=10.5.1.77
And some random linux messages originally formatted:
Jul 31 07:45:01 testz0a systemd: Created slice user-0.slice. Jul 31 07:45:01 testz0a systemd: Starting user-0.slice. Jul 31 07:45:01 testz0a systemd: Started Session 455 of user root. Jul 31 07:45:01 testz0a systemd: Starting Session 455 of user root. Jul 31 07:45:01 testz0a systemd: Removed slice user-0.slice. Jul 31 07:45:01 testz0a systemd: Stopping user-0.slice. Jul 31 07:48:30 testz0a dhclient: DHCPREQUEST on eth0 to 10.1.0.1 port 67 (xid=0x768c6bb3) Jul 31 07:48:30 testz0a dhclient: DHCPACK from 10.1.0.1 (xid=0x768c6bb3) Jul 31 07:48:32 testz0a dhclient: bound to 10.1.0.242 -- renewal in 1629 seconds.
Could instead be:
2017-07-31T07:45:01:00Z Created svc=systemd slice=user-0.slice host=testz0a 2017-07-31T07:45:01:00Z Starting svc=systemd slice=user-0.slice host=testz0a 2017-07-31T07:45:01:00Z Started Session svc=systemd id=455 user=root host=testz0a 2017-07-31T07:45:01:00Z Starting Session svc=systemd id=455 user=root host=testz0a 2017-07-31T07:45:01:00Z Removed svc=systemd slice=user-0.slice host=testz0a 2017-07-31T07:45:01:00Z Stopping svc=systemd slice=user-0.slice host=testz0a 2017-07-31T07:48:30:00Z DHCPREQUEST svc=dhclient ip=10.1.0.1 xid=0x768c6bb3 nic=eth0 port=67 host=testz0a pid=791 2017-07-31T07:48:30:00Z DHCPACK svc=dhclient ip=10.1.0.1 xid=0x768c6bb3 host=testz0a pid=791 2017-07-31T07:48:32:00Z bound svc=dhclient ip=10.1.0.242 renewal="1629 seconds" host=testz0a pid=791
Thoughts? If you like this idea, please share!
Image courtesy of Ilya Pavlov