jd/citrix-netscaler-14-monitor-ntp

Fork 0

A lightweight USER monitor that lets a Citrix NetScaler ADC v14.1 load balancer make per-backend NTP health decisions — going beyond a simple port check to verify each upstream NTP server is actually synchronized and serving accurate time.

public

Shell 51.3%
Perl 48.7%

Find a file

JD be90712097 Citrix NetScaler ADC v14.1 NTP backend health monitor		2026-06-19 04:04:48 +08:00
LICENSE	Citrix NetScaler ADC v14.1 NTP backend health monitor	2026-06-19 04:04:48 +08:00
ns_ntp_check.pl	Citrix NetScaler ADC v14.1 NTP backend health monitor	2026-06-19 04:04:48 +08:00
ns_ntp_check.sh	Citrix NetScaler ADC v14.1 NTP backend health monitor	2026-06-19 04:04:48 +08:00
NTP_NOTES.md	Citrix NetScaler ADC v14.1 NTP backend health monitor	2026-06-19 04:04:48 +08:00
README.md	Citrix NetScaler ADC v14.1 NTP backend health monitor	2026-06-19 04:04:48 +08:00

README.md

Citrix NetScaler ADC - NTP Backend Health Monitor

A lightweight USER monitor that lets a Citrix NetScaler ADC v14.1 load balancer make per-backend NTP health decisions — going beyond a simple port check to verify each upstream NTP server is actually synchronized and serving accurate time.

Description

I've created a KISS (Keep It Simple, Stupid) service monitor for Citrix NetScaler ADC v14.1 to validate individual NTP backends behind a load balancer. This monitor performs comprehensive health checks on NTP servers using both system variables and peer table analysis.

Standard load balancer monitors typically only confirm that UDP port 123 is open, which says nothing about whether the server's clock is trustworthy. This monitor digs deeper: it queries each backend with ntpq, evaluates stratum, reach, offset, delay, jitter, and peer selection against configurable thresholds, and returns a clear healthy/unhealthy verdict that NetScaler can act on. It supports both IPv4 and FQDN targets (including multi-A-record resolution) and ships as both a Bash and a Perl implementation.

Features

Core Functionality

Dual Query Method: I query both ntpq -c rv (system variables) and ntpq -pn (peer table) to get comprehensive NTP health metrics
IP and FQDN Support: The script accepts either IPv4 addresses or FQDNs as input
Multiple A Record Handling: For FQDNs with multiple A records, the script can either pass if any one is healthy (default) or require all to pass (--all-a flag)
Robust Resolution: The script tries multiple resolution methods in order: getent hosts, host -t A, then nslookup -type=A

Health Logic

I apply strict health criteria based on NTP best practices:

Metric	Default Threshold	Customizable Range	Description
Reach	≥224 (≥340 octal)	1-255	At least 5 out of 8 recent polls successful*
Stratum	1-5	1-15	Valid NTP hierarchy level
Delay	≤200ms	1-10000ms	Network latency threshold
Offset	≤500ms	1-10000ms	Time synchronization accuracy
Jitter	≤50ms	1-1000ms	Clock stability measurement
Selected Peers	≥1	N/A	Must have upstream peers (`*`, `+`, or `o` status)

*Reach Calculation: The reach value is an 8-bit bitmask representing the last 8 poll attempts (newest bit on right). Each successful poll sets a bit to 1, failed polls set to 0. The value shifts left with each new poll. Examples:

255 (11111111₂) = all 8 polls successful
224 (11100000₂) = last 3 polls failed, previous 5 successful
377 (11111111₈ octal) = all 8 polls successful

Threshold Customization

All health thresholds can be overridden on a per-server basis using command-line parameters:

--max-stratum <n>: Maximum allowed stratum (default: 5)
--max-delay <n>: Maximum delay in milliseconds (default: 200)
--max-offset <n>: Maximum offset in milliseconds (default: 500)
--max-jitter <n>: Maximum jitter in milliseconds (default: 50)
--min-reach <n>: Minimum reach value (default: 224)

This allows fine-tuning monitoring for specific server requirements or network conditions.

Why Both rv and -pn Queries?

I query both because they provide complementary information:

ntpq -c rv: Gives system-wide variables like stratum, offset, sys_jitter, and rootdelay
ntpq -pn: Shows peer-specific metrics including reach status and selection flags

The peer table is critical because it shows whether the NTP server has selected upstream time sources for synchronization. A server might have good system variables but be unhealthy if it hasn't selected any upstream peers due to network issues or configuration problems.

IP vs FQDN Handling

IPv4 Address Input

When you provide an IPv4 address, the script uses it directly without any resolution.

FQDN Input

When you provide an FQDN, the script resolves it using this fallback chain:

getent hosts <name> (preferred - uses system resolver)
host -t A <name> (fallback)
nslookup -type=A <name> (last resort)

Multiple A Records Behavior

Default mode: The script iterates through A records and considers the node healthy if ANY one passes all health checks
--all-a mode: The script requires ALL A records to pass health checks for the node to be considered healthy

This KISS approach works well for most load balancer scenarios where you want to know if at least one backend is available.

Installation on NetScaler ADC

Step 1: Upload Script

# Copy to NetScaler
scp ns_ntp_check.sh nsroot@<netscaler-ip>:/nsconfig/monitors/

# Set proper permissions
ssh nsroot@<netscaler-ip> "chmod 555 /nsconfig/monitors/ns_ntp_check.sh"

Step 2: Create USER Monitor

# Via CLI - Basic monitor (uses default thresholds)
add lb monitor ntp_health_check USER -scriptName ns_ntp_check.sh -interval 300 -retries 3

# Via CLI - With custom stratum threshold
add lb monitor ntp_strict_stratum USER -scriptName ns_ntp_check.sh -scriptArgs "--max-stratum 3" -interval 300 -retries 3

# Via GUI: System > Load Balancing > Monitors
# - Type: USER
# - Script Name: ns_ntp_check.sh  
# - Script Args: (leave empty for defaults, or "--max-stratum 3" for custom threshold)
# - Interval: 300 seconds (5 minutes for testing, 1800 for production)

Note: When using Script Args, the NetScaler ADC will automatically append the target IP and port to your custom arguments. For example, if you set Script Args to --max-stratum 3, the actual command executed will be:

ns_ntp_check.sh 192.168.1.100 123 --max-stratum 3

Step 3: Bind to Service Group

bind servicegroup ntp_backends -monitorName ntp_health_check

Testing

Local Testing with nsumon-debug.pl

# Test with IP address
nsumon-debug.pl ns_ntp_check.sh 192.168.1.100 123 8 0

# Test with FQDN
nsumon-debug.pl ns_ntp_check.sh pool.ntp.org 123 8 0

Direct Script Testing

# Basic usage
./ns_ntp_check.sh 192.168.1.100
./ns_ntp_check.sh pool.ntp.org

# With custom port and timeout
./ns_ntp_check.sh 192.168.1.100 --port 123 --timeout 10

# Require all A records to pass (for FQDNs)
./ns_ntp_check.sh pool.ntp.org --all-a

# Custom thresholds for strict monitoring
./ns_ntp_check.sh 192.168.1.100 --max-stratum 3 --max-delay 100 --max-offset 200

# Relaxed thresholds for distant servers
./ns_ntp_check.sh remote.ntp.org --max-delay 500 --max-offset 1000 --min-reach 192

# Show help
./ns_ntp_check.sh --help

Example Outputs

# Healthy server (exit code 0, status on stderr)
$ ./ns_ntp_check.sh 192.168.1.100
NTP server healthy: stratum=3, offset=-3.76ms, jitter=4.0ms, delay=102.3ms, peers=3, reach=255
$ echo $?
0

# Unhealthy server (exit code 1, reason on stderr)
$ ./ns_ntp_check.sh 192.168.1.200
Stratum out of range: 16 (must be 1-5)
$ echo $?
1

Common Pitfalls and Solutions

NTP Servers Blocking Mode-6 Queries

Problem: Some NTP servers block ntpq mode-6 queries for security reasons.

Solution:

For Cisco IOS: ntp allow mode 6
For Linux chrony: Configure cmdallow in chrony.conf
For ntpd: Ensure restrict lines don't block queries from NSIP

NSIP as Source Address

Problem: NTP servers may have ACLs that don't allow queries from the NetScaler NSIP.

Temporary Fix:

# Test from NetScaler shell
shell
ntpq -c rv <target-ip>
ntpq -pn <target-ip>

Permanent Fix: Update NTP server ACLs to allow NetScaler NSIP.

chrony vs ntpd Differences

Problem: Different NTP implementations may have slightly different output formats.

Solution: I've designed the parsing to be robust across implementations, but test thoroughly with your specific NTP servers.

ntpsec Compatibility

Problem: ntpsec (modern NTP implementation) may have different field names.

Solution: The script handles both sys_jitter and clk_jitter fields for compatibility.

Troubleshooting

Enable Temporary Mode-6 Access

If you're getting "Failed to query system variables" errors:

# On Cisco IOS NTP server
configure terminal
ntp allow mode 6
exit

# On Linux ntpd server  
# Add to /etc/ntp.conf:
restrict <netscaler-nsip> nomodify notrap

# Restart ntpd
systemctl restart ntp

Check NetScaler Monitor Logs

# SSH to NetScaler
ssh nsroot@<netscaler-ip>

# Check monitor daemon logs
tail -f /var/nslog/nsumond.log

# Look for your script execution and any errors
grep "ns_ntp_check" /var/nslog/nsumond.log

Debug Script Execution

# Run with verbose ntpq output
ntpq -c rv <target-ip>
ntpq -pn <target-ip>

# Check if target appears in peer table
ntpq -pn <target-ip> | grep <target-ip>

# Verify network connectivity
ping <target-ip>
telnet <target-ip> 123

Common Error Messages

"Failed to query system variables": Network connectivity or mode-6 blocking
"No selected peers - server not synchronized": Server has no upstream time sources
"Selected peer reach too low": Intermittent connectivity to upstream servers
"Stratum out of range": Server not synchronized or configured incorrectly
"Offset too high": Time difference exceeds acceptable threshold
"Jitter too high": Clock instability detected
"Delay too high": Network latency to upstream servers excessive

Script Variants

I've provided both Bash and Perl implementations:

ns_ntp_check.sh: Primary Bash script (recommended)
ns_ntp_check.pl: Perl variant with identical functionality

Both scripts have identical command-line interfaces and health logic. Choose based on your preference and environment constraints.

Performance Characteristics

Execution time: Typically 1-3 seconds per IP address
Memory usage: Minimal (< 1MB)
Network overhead: Two small UDP queries per target IP
Timeout handling: Configurable with --timeout flag (default 8 seconds)

The KISS design ensures reliable operation even under high load on the NetScaler appliance.