Elk Log Format: Structure and Parse Your Data Efficiently

Published:

Ever notice how searching raw log files feels like trying to find a specific grain of sand on a beach? That’s because unstructured text logs are basically useless for filtering, aggregating, or troubleshooting production issues at scale. ELK log formatting solves this by transforming messy text into structured JSON documents that Elasticsearch can actually index and search. This guide shows you how to set up Logstash pipelines, parse common formats with Grok patterns, and turn your logs into queryable data in minutes.

What ELK Log Formatting Means: A Practical Example

DlTNmU24SoOdgTld8iMfRg

ELK log formatting is how you turn raw, messy log text into searchable JSON documents that Elasticsearch can actually work with. Without this step, your logs stay as plain text strings that you can’t filter, aggregate, or analyze worth a damn.

Here’s what an unformatted Apache access log looks like:

192.168.1.10 - - [15/Jan/2024:14:23:55 +0000] "GET /api/users HTTP/1.1" 200 1234 "https://example.com" "Mozilla/5.0"

A Logstash config transforms that raw text into structured data:

input {
  file {
    path => "/var/log/apache/access.log"
  }
}

filter {
  grok {
    match => { "message" => "%{IPORHOST:client_ip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] \"%{WORD:http_method} %{DATA:request_uri} HTTP/%{NUMBER:http_version}\" %{NUMBER:status_code} %{NUMBER:bytes} \"%{DATA:referrer}\" \"%{DATA:user_agent}\"" }
  }
  date {
    match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "apache-logs-%{+YYYY.MM.dd}"
  }
}

The formatted result as JSON in Elasticsearch:

{
  "client_ip": "192.168.1.10",
  "timestamp": "2024-01-15T14:23:55.000Z",
  "http_method": "GET",
  "request_uri": "/api/users",
  "http_version": "1.1",
  "status_code": 200,
  "bytes": 1234,
  "referrer": "https://example.com",
  "user_agent": "Mozilla/5.0"
}

This structured format lets you do field-based searching in Kibana. You can filter by specific status codes, aggregate traffic by client IP, or visualize request volumes over time. With raw text logs, you’re stuck doing pattern matching against the entire string. Complex queries run slow and analysis becomes nearly impossible.

The transformation from unstructured text to structured JSON? That’s what makes the ELK stack useful for production log management instead of just a glorified text file storage system.

Logstash Pipeline Components and Log Transformation

J53olUe3QAetj5y8fwBQgg

Logstash acts as the processing layer. It receives raw logs, transforms them into structured format, and sends the results to Elasticsearch for storage and indexing. The pipeline has three stages that handle ingestion, transformation, and output.

Input Plugins and Data Ingestion

Input plugins define where Logstash receives log data from. There are over 200 pre-built plugins supporting sources like Filebeat, syslog, HTTP endpoints, and cloud services (Azure Event Hubs, AWS CloudWatch, Google Cloud Storage, Amazon S3).

The beats input receives logs from Filebeat agents running on your application servers:

input {
  beats {
    port => 5044
  }
}

The file input monitors log files directly on the Logstash server. It tracks positions automatically so you don’t reprocess the same lines:

input {
  file {
    path => "/var/log/application/*.log"
    start_position => "beginning"
  }
}

The syslog input listens on a network port for syslog messages from network devices and servers:

input {
  syslog {
    port => 514
  }
}

Filter Plugins and Data Transformation

The filter stage parses, transforms, and enriches log data. Plugins extract fields from unstructured text. The grok filter uses regular expression patterns to pull structured data from text logs. The json filter parses JSON-formatted message fields. The mutate filter modifies field values.

The date filter converts timestamp strings into proper date objects that Elasticsearch can index for time-based queries. The dissect filter provides faster parsing than grok for logs with consistent delimiters. The kv filter extracts key-value pairs from logs formatted as “key1=value1 key2=value2”.

The geoip filter enriches logs with geographic info based on IP addresses, adding fields like country code, city, and coordinates. The useragent filter parses User-Agent strings from web logs to extract browser, OS, and device information.

Filter plugins run in the order you define them. You can chain transformations where one filter’s output becomes the next filter’s input.

Output Configuration to Elasticsearch

The output stage sends processed logs to Elasticsearch with proper indexing config. The elasticsearch output plugin batches events for bulk indexing to improve throughput:

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
    document_type => "_doc"
  }
}

The index pattern uses date formatting to create daily indices. This enables efficient data lifecycle management where you can delete older indices or move them to slower storage. Bulk operations group multiple events into single HTTP requests, cutting network overhead and improving indexing speed.

Essential Logstash pipeline config practices:

  • Deploy at least two Logstash nodes for uptime and stability during maintenance or failures
  • Allocate enough heap memory based on pipeline complexity and throughput needs (typically 2-8 GB for most workloads)
  • Configure pipeline.workers to match available CPU cores for parallel processing
  • Set appropriate batch sizes to balance memory usage against processing efficiency (typically 125-250 events per batch)
  • Use conditional logic in filters so different log formats go through the right parsing rules without sending everything through every filter

Parsing Common Log Formats with Grok Patterns

TOK2TwFzQ2qmZYV7VmeUjw

Grok patterns combine regular expressions with named fields to extract structured data from unstructured text logs. Logstash includes built-in patterns for common formats, and you can create custom patterns for application-specific stuff.

Apache and Nginx Web Server Logs

Apache Combined Log Format is one of the most common web server formats:

filter {
  grok {
    match => { "message" => "%{COMMONAPACHELOG} \"%{DATA:referrer}\" \"%{DATA:user_agent}\"" }
  }
}

This pattern extracts client IP, timestamp, HTTP method, URI, protocol version, status code, response bytes, referrer, and user agent from a log line like:

10.0.1.45 - admin [15/Jan/2024:08:15:42 +0000] "POST /api/login HTTP/1.1" 200 892 "https://app.example.com/login" "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"

Nginx access logs use a similar but slightly different format:

filter {
  grok {
    match => { "message" => "%{IPORHOST:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] \"%{WORD:request_method} %{DATA:request_uri} HTTP/%{NUMBER:http_version}\" %{NUMBER:status} %{NUMBER:body_bytes_sent} \"%{DATA:http_referer}\" \"%{DATA:http_user_agent}\"" }
  }
}

Both patterns create searchable fields. You can filter logs by status code ranges (4xx client errors, 5xx server errors), analyze traffic by URI path, or identify problematic user agents.

Syslog and System Logs

RFC3164 syslog format follows this pattern:

filter {
  grok {
    match => { "message" => "%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:hostname} %{DATA:program}(?:\[%{POSINT:pid}\])?: %{GREEDYDATA:syslog_message}" }
  }
}

This extracts timestamp, hostname, process name, process ID, and message content from logs like:

Jan 15 14:30:25 webserver01 sshd[12345]: Accepted publickey for admin from 10.0.1.100 port 52341 ssh2

RFC5424 syslog adds structured data elements:

filter {
  grok {
    match => { "message" => "%{SYSLOG5424PRI}%{NONNEGINT:version} +(?:%{TIMESTAMP_ISO8601:timestamp}|-) +(?:%{HOSTNAME:hostname}|-) +(?:%{NOTSPACE:program}|-) +(?:%{NOTSPACE:pid}|-) +(?:%{NOTSPACE:msgid}|-) +(?:%{SYSLOG5424SD:sd}|-|) +%{GREEDYDATA:message}" }
  }
}

This format includes facility and severity codes that map to specific system components and urgency levels.

Application Error and Debug Logs

Multi-line stack traces need special handling to group exception details with the initial error:

filter {
  grok {
    match => { "message" => "(?<timestamp>%{YEAR}-%{MONTHNUM}-%{MONTHDAY} %{TIME}) %{LOGLEVEL:log_level} \[%{DATA:thread}\] %{JAVACLASS:class} - %{GREEDYDATA:log_message}" }
  }
}

For Java applications, this pattern extracts structured fields from logs like:

2024-01-15 14:30:45.123 ERROR [http-nio-8080-exec-1] com.example.UserService - Failed to create user account
java.sql.SQLException: Connection timeout
    at com.example.db.ConnectionPool.getConnection(ConnectionPool.java:45)
    at com.example.UserService.createUser(UserService.java:78)
Log Type Sample Log Line Grok Pattern Extracted Fields
Apache Access 192.168.1.5 – – [15/Jan/2024:10:30:15 +0000] “GET /index.html HTTP/1.1” 200 5432 %{COMMONAPACHELOG} clientip, timestamp, verb, request, httpversion, response, bytes
SSH Authentication Jan 15 10:30:15 server01 sshd[1234]: Failed password for invalid user test from 10.0.1.50 %{SYSLOGTIMESTAMP:timestamp} %{HOSTNAME:host} %{WORD:program}\[%{NUMBER:pid}\]: %{GREEDYDATA:message} timestamp, host, program, pid, message
JSON Application {“timestamp”:”2024-01-15T10:30:15.123Z”,”level”:”ERROR”,”message”:”Database connection failed”} Built-in JSON filter, no Grok needed timestamp, level, message (all JSON fields)
Custom App Log [2024-01-15 10:30:15] [WARN] UserController: Login attempt from blocked IP 10.0.1.75 \[%{TIMESTAMP_ISO8601:timestamp}\] \[%{LOGLEVEL:level}\] %{WORD:class}: %{GREEDYDATA:message} timestamp, level, class, message

The pattern syntax uses %{PATTERN:fieldname} format. PATTERN is a built-in or custom Grok pattern and fieldname is what you want to call the extracted value in the resulting JSON document. You can chain multiple patterns for complex logs by combining them in sequence. Use conditional logic to try different patterns when log format varies within the same source.

JSON Logging and Elastic Common Schema Standardization

aYzQpXcvSLOTVglJ6NMqGA

JSON format is the preferred structured logging approach for modern applications. It bypasses heavy parsing requirements in Logstash and arrives at Elasticsearch ready for indexing.

A properly formatted JSON log entry includes essential fields that enable immediate filtering and correlation:

{
  "@timestamp": "2024-01-15T14:30:45.123Z",
  "log.level": "ERROR",
  "message": "Failed to process payment transaction",
  "service.name": "payment-api",
  "service.version": "2.3.1",
  "trace.id": "a3f7b2c1-4d8e-9f6a-2b1c-8e3d7a9f2c4b",
  "user.id": "user_12345",
  "http.request.method": "POST",
  "http.response.status_code": 500,
  "error.message": "Connection timeout to payment gateway"
}

The Elastic Common Schema (ECS) is the standard field naming convention that enables normalized querying across different log sources. Instead of one application using “ipaddress” while another uses “clientip” and a third uses “remote_addr”, ECS defines standard field names that all sources should use.

ECS standardization provides benefits. Cross-source queries where a single search like source.ip: "10.0.1.50" works across web server logs, application logs, and firewall logs without knowing their individual field naming schemes. Dashboards become reusable across different data sources because field names match. You don’t need separate visualizations for each application. Field mappings remain consistent, so Elasticsearch knows that http.response.status_code is always a number and user.name is always a keyword.

Common ECS fields include host.name for the server hostname, user.name for authenticated username, event.action describing what happened (login, fileaccess, apicall), http.request.method for HTTP verbs, source.ip and destination.ip for network communication, and error.stack_trace for detailed exception information.

Applications use logging libraries to automatically generate ECS-compliant JSON logs without manual field mapping. The zap library with ecszap formatter for Go creates conformant log entries with timestamp, source file, code line, and log level information. The winston library for Node.js includes ECS formatting options that structure logs correctly. Python applications use ecs-logging for automatic ECS compliance, and Java applications can use the ecs-logging-java library.

Kibana Integration and End-to-End Log Flow

P_mqvUiDRwu56Zv6LQ-NfA

Kibana provides the visualization interface that consumes formatted log data from Elasticsearch, completing the end-to-end pipeline from application to analysis.

Index patterns connect Kibana to Elasticsearch indices with proper field mapping for searching and filtering. The pattern logs-* matches all indices starting with “logs-“. You can query across daily indices like logs-2024.01.15, logs-2024.01.16, and logs-2024.01.17 with a single search. Field mapping tells Kibana which fields are text (analyzed for full-text search), keywords (exact matching), numbers (range queries), or dates (time-based filtering).

The Discover interface provides ad-hoc log searching where well-formatted fields enable filtering by log level (level: “ERROR”), timestamp ranges (using the time picker), specific field values (service.name: “payment-api”), and pattern matching (message: timeout). You can combine multiple filters with AND/OR logic, save searches for reuse, and inspect individual log documents to see all extracted fields and their values.

Visualization types best suited for different log analysis scenarios:

  • Time-series histograms show log volume over time to identify traffic spikes or system outages
  • Pie charts display error distribution across services or status code breakdowns to find which endpoints fail most
  • Data tables present detailed log inspection with sortable columns for drilling into specific events
  • Line graphs track trends in response times, error rates, or resource usage over hours or days
  • Heat maps reveal correlation between variables like error rates by hour-of-day and day-of-week to spot patterns

Filebeat and Log Shipping Configuration

WgPwNEW4QnmtyZ3N7g8qYg

Filebeat is the lightweight log shipper in the Beats framework. It’s designed for forwarding log files from servers and containers to Logstash or Elasticsearch. It tracks file positions to ensure every line gets shipped exactly once, even if Filebeat restarts or the destination is temporarily unavailable.

Basic Filebeat config includes input paths specifying which log files to monitor, output destination pointing to either Logstash for processing or directly to Elasticsearch for pre-formatted logs, and multiline handling for log entries that span multiple lines:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/application/*.log
  multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
  multiline.negate: true
  multiline.match: after

output.logstash:
  hosts: ["logstash:5044"]

This configuration monitors all .log files in /var/log/application/, uses multiline pattern matching to group stack traces with their initiating log line, and sends collected logs to Logstash on port 5044 for processing before Elasticsearch storage.

Docker integration enables Filebeat to automatically discover containers based on labels without manually configuring each container’s log path. The autodiscovery feature watches for containers labeled with co.elastic.logs/enabled: true and automatically collects their logs:

filebeat.autodiscover:
  providers:
    - type: docker
      hints.enabled: true
      hints.default_config:
        type: container
        paths:
          - /var/lib/docker/containers/${data.container.id}/*.log

processors:
  - add_docker_metadata: ~

output.logstash:
  hosts: ["logstash:5044"]

Docker Compose orchestration ties everything together with a config mounting the Filebeat config file and Docker socket:

version: '3'
services:
  filebeat:
    image: docker.elastic.co/beats/filebeat:8.11.0
    volumes:
      - ./filebeat.yaml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on:
      - logstash

Timestamp Parsing and Date Formatting

muguQrXoTaGqanr-cJeDsw

Accurate timestamp parsing is crucial for time-series log analysis, chronological troubleshooting, and correlation of events across distributed systems where components running in different data centers must align their event timelines.

Common timestamp formats require specific parsing patterns to convert text representations into proper date objects. ISO8601 format 2024-01-15T14:30:45.123Z is the standard for modern applications and parses directly without pattern config. Unix epoch 1705329045 represents seconds since January 1, 1970 and converts with a simple numeric interpretation. Apache log format 15/Jan/2024:14:30:45 +0000 uses abbreviated month names requiring the pattern dd/MMM/yyyy:HH:mm:ss Z. Syslog format Jan 15 14:30:45 lacks year and timezone info, defaulting to current year and system timezone with pattern MMM dd HH:mm:ss. Custom application formats like 2024-01-15 14:30:45.123 need explicit patterns such as yyyy-MM-dd HH:mm:ss.SSS.

The Logstash date filter config matches timestamp strings and converts them to the @timestamp field that Elasticsearch indexes for time-based queries:

filter {
  grok {
    match => { "message" => "%{TIMESTAMP_ISO8601:log_timestamp} %{LOGLEVEL:level} %{GREEDYDATA:log_message}" }
  }
  date {
    match => [ "log_timestamp", "ISO8601", "yyyy-MM-dd HH:mm:ss.SSS", "dd/MMM/yyyy:HH:mm:ss Z" ]
    target => "@timestamp"
    timezone => "UTC"
  }
}

This config tries multiple patterns in order until one successfully parses the timestamp. Then it writes the result to @timestamp in UTC format regardless of the original timezone.

Timezone handling requires standardizing to UTC for distributed systems to avoid timezone-related correlation errors. A log entry from a server in New York at 14:00 EST and another from London at 19:00 GMT represent the same moment. But without UTC normalization they appear five hours apart in Kibana timeline views. Converting all timestamps to UTC during ingestion ensures chronological ordering works correctly when investigating incidents that span multiple geographic regions.

Common timestamp formats with concrete examples:

  • ISO8601 with milliseconds: 2024-01-15T14:30:45.123Z
  • ISO8601 with timezone offset: 2024-01-15T09:30:45-05:00
  • Unix epoch seconds: 1705329045
  • Unix epoch milliseconds: 1705329045123
  • Apache Combined Log Format: 15/Jan/2024:14:30:45 +0000
  • Standard syslog: Jan 15 14:30:45

Multi-Line Event Handling and Log Aggregation

ua_oXd5ZRmOSjJyubNc0dQ

Single log events that span multiple lines create the multi-line problem. Java stack traces, Python tracebacks, and formatted JSON objects get split into separate events if each line is treated independently. This breaks analysis and makes error investigation nearly impossible.

The multiline codec in Filebeat and Logstash uses pattern matching config to identify event boundaries with three key settings. pattern defines the regular expression that identifies either the start of a new event or continuation lines. negate when true means the pattern identifies event starts rather than continuation lines. what set to “previous” groups continuation lines with the previous event or “next” groups them with the following event.

Configuration examples show pattern matching for Java exceptions and line continuation rules:

filebeat.inputs:
- type: log
  paths:
    - /var/log/application/app.log
  multiline.type: pattern
  multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
  multiline.negate: true
  multiline.match: after

This groups all lines not starting with a date pattern (stack trace lines beginning with whitespace or “at”) with the previous log entry that did start with a timestamp.

For Python tracebacks that start with “Traceback (most recent call last):” followed by indented lines:

filebeat.inputs:
- type: log
  paths:
    - /var/log/python/app.log
  multiline.type: pattern
  multiline.pattern: '^Traceback|^  File|^    '
  multiline.negate: false
  multiline.match: after

Lines starting with whitespace belong to the previous event, keeping the entire traceback together as a single searchable log entry with complete context.

Field Mapping and Index Templates in Elasticsearch

LnZXD5YTQrSqr4T3JeD21Q

Field mapping is Elasticsearch’s schema definition for log documents. It determines how fields are indexed and searched, controlling whether text gets analyzed for full-text search or stored as exact values for filtering.

Text fields get analyzed with tokenization that breaks content into individual words for full-text searching. This enables queries like “find logs containing ‘connection’ AND ‘timeout'” to match the message “Database connection timeout occurred”. Keyword fields use exact matching for filtering, aggregations, and sorting without tokenization. This makes them suitable for status codes, hostnames, user IDs, and other values you want to count or group by rather than search within.

Index templates automatically apply mappings to new indices matching a pattern like logs-* or filebeat-*, ensuring consistent field types across time-based indices. When Elasticsearch creates logs-2024.01.15 followed by logs-2024.01.16 the next day, the template ensures both indices map http.response.status_code as a number and user.name as a keyword:

{
  "index_patterns": ["logs-*"],
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "log.level": { "type": "keyword" },
        "message": { "type": "text" },
        "service.name": { "type": "keyword" },
        "http.response.status_code": { "type": "long" },
        "source.ip": { "type": "ip" }
      }
    }
  }
}

Mapping conflicts occur when field types don’t match expectations, like trying to index a text value into a field previously mapped as a number. This causes indexing failures and data loss. Schema changes in Elasticsearch require creating a new index and migrating all existing data through a reindex operation. That can take hours or days for terabyte-scale log volumes, emphasizing the importance of planning field mappings before ingesting production data.

Field Type Use Case Search Behavior Example
text Full-text search in log messages, error descriptions, user comments Analyzed with tokenization, case-insensitive, stemming applied “Failed to connect to database” becomes searchable tokens [failed, connect, database]
keyword Exact matching for filtering, sorting, aggregating hostnames, status codes, tags Not analyzed, case-sensitive, exact match only service.name: “payment-api” matches exactly, “Payment-API” does not
date Timestamps, event times, expiration dates Range queries, time-based filtering, date histograms @timestamp: “2024-01-15T14:30:45Z” enables queries like “last 24 hours”
long Numeric values like response codes, byte counts, durations in milliseconds Range queries, mathematical aggregations (sum, avg, percentiles) http.response.status_code: 500 for exact match, or status_code >= 500 for range
ip IPv4 and IPv6 addresses for source and destination tracking CIDR range matching, geographic lookups source.ip: “10.0.0.0/8” matches entire private network range

Testing and Debugging Log Parsing Configurations

A_QiPAG3Rx6g8K8tFTwYxA

Testing parsing rules with sample log data before processing production logs prevents data loss and mapping conflicts that require expensive reindexing operations to fix.

The Grok Debugger tool in Kibana (accessed through Dev Tools > Grok Debugger) validates patterns against sample log entries with real-time feedback showing which fields get extracted and their values. Paste a sample log line, enter your Grok pattern, click Test, and immediately see whether the pattern matches and what structured data gets extracted. This catches mistakes like incorrect field names, missing escape characters, or patterns that don’t account for optional fields before any logs reach Elasticsearch.

The rubydebug codec outputs parsed events to stdout during development, allowing inspection of field names and values:

output {
  stdout {
    codec => rubydebug
  }
}

Run Logstash with this output config and watch the console to verify fields are extracted correctly, data types match expectations, and transformations produce the intended results.

Common parsing errors include field type mismatches where a field mapped as a number receives text values causing indexing failure, missing required fields that break dashboard queries expecting specific field names to exist, failed pattern matches that leave logs unparsed with raw message text, and the _grokparsefailure tag added to events when Grok patterns don’t match the input format.

Systematic troubleshooting when logs aren’t parsing correctly:

  1. Check the raw log format by examining actual log files to ensure the format matches your pattern expectations. Look for variations in timestamp formats, optional fields, or encoding issues.
  2. Validate the Grok pattern using Kibana’s Grok Debugger with representative samples including edge cases like missing optional fields or unusual values.
  3. Verify field mapping in the Elasticsearch index template to ensure data types match the values being indexed and required fields are defined.
  4. Inspect Logstash logs for parsing errors, warnings about type mismatches, or patterns that consistently fail to match.

Performance Optimization for Log Processing Pipelines

Parsing complexity directly affects throughput. Complex Grok patterns using multiple nested regular expressions run slower than simpler methods like dissect or JSON parsing. A Grok pattern with ten field extractions and conditional logic might process 1,000 events per second per worker thread. Meanwhile the dissect filter parsing the same fixed-format log can handle 10,000 events per second. For high-volume scenarios processing millions of logs per day, choose JSON structured logging when possible, use dissect for consistently formatted text logs, and reserve Grok patterns for irregular formats that require flexible matching.

Logstash performance tuning parameters control how the pipeline processes events. The pipeline.workers setting (default matches CPU core count) controls parallelism by determining how many filter and output threads run simultaneously. The pipeline.batch.size setting (default 125) groups events into batches before processing, reducing per-event overhead at the cost of memory usage. The pipeline.batch.delay setting (default 50 milliseconds) controls how long to wait for a batch to fill before processing a partial batch, balancing latency against throughput.

Increasing workers from 4 to 8 on an 8-core system can double throughput if filtering is CPU-bound. Raising batch size from 125 to 500 reduces context switching overhead but increases memory consumption and the amount of data lost if Logstash crashes mid-batch. These settings interact with heap memory allocation, where larger batches and more workers require more memory to hold events in flight.

Elasticsearch bulk indexing settings determine how quickly processed logs reach storage. The bulk size controls how many documents get sent in a single indexing request. Larger bulks (1,000 to 5,000 documents) reduce HTTP request overhead but increase memory usage and retry complexity if the request fails. Proper batching improves ingestion rates from a few thousand to tens of thousands of documents per second while avoiding memory pressure that triggers garbage collection pauses.

Configuration Parameter Default Value Tuning Guidance Impact
pipeline.workers CPU core count Increase to match CPU cores for CPU-bound filtering, decrease if memory-limited Higher values increase parallelism and throughput but consume more memory
pipeline.batch.size 125 Increase to 250-1000 for high throughput, decrease if memory pressure occurs Larger batches reduce overhead and improve throughput but increase latency and memory usage
heap memory 1 GB Allocate 2-8 GB based on batch size and worker count, monitor GC frequency Insufficient heap causes out-of-memory errors, excessive heap wastes resources
bulk size Varies by output Set to 1000-5000 for Elasticsearch output to balance request size and retry complexity Larger bulks reduce indexing overhead but increase data loss risk on failure
refresh interval 1 second Increase to 30s or disable during bulk loading for faster indexing Longer intervals reduce indexing overhead but delay data availability for search

Log Format Best Practices and Implementation Guidelines

Planning log schema before implementation avoids costly reindexing operations. Elasticsearch schema changes require creating new indices and migrating all existing data, an operation that can run for days on production systems with billions of log entries.

Essential fields every log entry should include: @timestamp with timezone information to ensure accurate time-based querying across distributed systems, severity or log level (ERROR, WARN, INFO, DEBUG) to filter noise and prioritize critical issues, message containing human-readable description of what happened, service or application name to identify which component generated the log when hundreds of services share infrastructure, host identifier showing which server or container produced the log for infrastructure correlation, and correlation IDs like trace.id or request.id for distributed tracing across microservices that handle a single user request.

Final Words

ELK log format transforms raw text into searchable JSON documents that make debugging and monitoring actually work.

The pipeline moves from application to Filebeat to Logstash (where Grok patterns or JSON parsing happens) to Elasticsearch storage, then to Kibana for visualization.

Get your schema right early. Test your Grok patterns before production. Use structured JSON logging when you can.

Start with common formats like Apache or syslog, validate with the Grok Debugger, and tune performance as your volume grows. Two Logstash nodes and proper batching handle most production loads without drama.

FAQ

What is the best format for ELK logs?

The best format for ELK logs is JSON with Elastic Common Schema (ECS) field naming conventions. JSON bypasses heavy parsing requirements, maps directly to Elasticsearch documents, and enables consistent querying across different log sources when following ECS standards.

What is the ELK format?

The ELK format refers to structured JSON documents stored in Elasticsearch after transformation through Logstash pipelines. Raw logs are parsed into field-based JSON with standardized field names, enabling powerful filtering, aggregation, and visualization through Kibana’s interface.

What is the standard log format?

The standard log format for ELK deployments is Elastic Common Schema (ECS) compliant JSON. ECS defines normalized field names like host.name, user.name, event.action, and http.request.method, ensuring consistency across applications and simplifying dashboard creation and cross-source queries.

What is an ELK log?

An ELK log is a structured JSON document stored in Elasticsearch after processing through the Logstash pipeline. It contains parsed fields extracted from raw log text, indexed for rapid search and retrieval, and accessible through Kibana for visualization and analysis.

How does Logstash transform raw logs into structured format?

Logstash transforms raw logs using three-stage pipelines with input, filter, and output plugins. Input plugins receive data from sources, filter plugins parse and enrich logs using patterns like Grok, and output plugins send structured JSON documents to Elasticsearch for storage.

What are Grok patterns in Logstash?

Grok patterns are regex-based rules that extract structured data from unstructured text logs in Logstash. They use syntax like %{PATTERN:field_name} to match log components and assign them to named fields, transforming raw Apache or Nginx logs into searchable JSON documents.

Why use JSON logging instead of plain text?

JSON logging bypasses heavy parsing in Logstash, reduces pipeline complexity, and maps directly to Elasticsearch’s document format. Applications using logging libraries like zap or winston generate ECS-compliant JSON automatically, improving pipeline performance and ensuring consistent field naming from the start.

What does Filebeat do in the ELK stack?

Filebeat is a lightweight log shipper that collects log files from servers and containers, then forwards them to Logstash or Elasticsearch. It supports Docker autodiscovery using container labels, handles multiline events, and ensures reliable delivery with backpressure handling.

How do you handle multi-line log entries like stack traces?

Multi-line log entries are handled using multiline codecs in Filebeat or Logstash with pattern matching configuration. Set pattern boundaries to identify event starts (like timestamps) and continuation rules (lines starting with whitespace), merging Java exceptions or Python tracebacks into single events.

What is the difference between text and keyword fields in Elasticsearch?

Text fields are analyzed for full-text search with tokenization and lowercase conversion, while keyword fields enable exact matching for filtering, aggregations, and sorting. Use text for message content and keyword for identifiers like service names, IP addresses, or status codes.

Why is timestamp parsing important in log formatting?

Timestamp parsing is crucial for chronological troubleshooting, time-series analysis, and event correlation across distributed systems. Logstash date filters convert various timestamp formats into the standardized @timestamp field, and using UTC prevents timezone-related correlation errors in global deployments.

How do you test Grok patterns before production deployment?

Test Grok patterns using Kibana’s Grok Debugger tool in Dev Tools, which validates patterns against sample log entries with real-time field extraction feedback. Use the rubydebug codec to output parsed events to stdout during development, inspecting field names and values before production deployment.

What impacts log processing performance in Logstash?

Log processing performance is impacted by parsing complexity, pipeline.workers (parallelism), pipeline.batch.size (events per batch), and heap memory allocation. Complex Grok patterns reduce throughput compared to simpler methods like dissect or native JSON parsing in high-volume scenarios.

What fields should every log entry include?

Every log entry should include timestamp with timezone, severity level (ERROR, WARN, INFO, DEBUG), human-readable message, service or application name, host identifier, and correlation IDs for distributed tracing. These fields enable effective troubleshooting, filtering, and cross-service correlation in production environments.

How do index templates work in Elasticsearch?

Index templates automatically apply field mappings to new indices matching a pattern like logs-* or filebeat-*, ensuring consistent field types across time-based indices. They define how fields are indexed and searched before documents arrive, preventing mapping conflicts and supporting proper aggregations.

curtisharmon
Curtis has spent over two decades guiding hunters and anglers through the backcountry of Montana and Wyoming. His expertise in elk hunting and fly fishing has made him a sought-after voice in the outdoor community. Curtis combines traditional woodsmanship with modern techniques to help readers succeed in the field.

Related articles

Recent articles