Palo Alto Breaks FQDN NAT's with PAN-OS 9.x

First, I want to say that I love my Palo Alto firewall, a PA-220.  It is a tiny box with a ton of features and owning a "lab" unit is perfect for being on a budget.  It is a very inexpensive annual subscription and gives me the ability to play with all those cool new features I should not try on the company firewall at work in production.  Now to the meat...

Introduction

I have been running this PA-220 for a couple years now and PAN-OS 9.0 was recently released.  One great thing Palo Alto did with 9.0 is they introduced an FQDN Refresh Enhancement feature which expires and refreshes cached DNS entries based on their individual TTL values.  Some of you might think this is the way it should have worked from the beginning, but keep in mind you are not using your firewall for DNS resolution -- in fact, the DNS resolution necessary on the firewall is few and far between.  However, when you have a SOHO setup where the dynamic IP the ISP gives you works just fine, you still need some way of remotely managing that location and accessing the hosted services there.  This is where dynamic DNS comes into play.

Background/Use Case

First, let us look at what I am even talking about here starting with a picture.

NAT Policies using an FQDN as the Destination Address

The above picture is a snippet from my own personal firewall of how I expose access to my home surveillance system.  Here, the Destination Address is an Address Object that points to an FQDN type address instead of an IP Netmask.  The Security policy uses the same Destination Address as well.  The FQDN points to a DNS A record that is updated by a system running inside my house to match the external IP provided by my ISP.  Simple enough.  All of this was working fine until I upgraded to PAN-OS 9.x.

NOTE: PAN-OS 9.0.3 is the latest PAN-OS available at the time of this blog post and is still an issue.  I have been working with Palo Alto to resolve this and I encourage anyone experiencing the same to open a support case to troubleshoot and expand Palo Alto's awareness of how many users are affected.

Symptoms

When the NAT policy does not have the correct FQDN information, the configured NAT's simply do not work.  You cannot access the services you configured because the NAT policy does not match the packet data to the running configuration.  Any NAT policies using an IP Netmask work fine.

Cause/Identification

While the exact cause is still unknown, I can tell you what to look for and break down the basics.  You will need to connect to your firewall using SSH.

Like any other DNS client, the firewall has its own DNS cache.  The first thing to check is if the DNS client in the firewall has the correct DNS information about your FQDN by running show dns-proxy fqdn all.
admin@Hossy_PA-220> show dns-proxy fqdn all

FQDN Table : Request time 2019-07-24 13:13:56
--------------------------------------------------------------------------------
        IP Address
--------------------------------------------------------------------------------

VSYS : (using mgmt-obj dnsproxy object)
        Shared
        vsys1

hossy-nas01.mydomain.tld
        12.34.56.78
        ::  unknown
If 12.34.56.78 is your current external IP, then we know the DNS cache in the firewall is correct.  This is rarely a problem, but if wrong could indicate an upstream caching issue or a TTL that is too high.

Now that we know the firewall is correctly resolving your FQDN, we can inspect the running NAT policy to see if the IP has been updated there by running show running nat-policy.  This is usually where the disconnect is.

admin@Hossy_PA-220> show running nat-policy

"Alibi CMS RTSP; index: 14" {
        nat-type ipv4;
        from untrust;
        source any;
        to untrust;
        to-interface  ;
        destination 0.0.0.0;
        service 0:tcp/any/123;
        translate-to "dst: 192.168.1.2:1050";
        terminal no;
}

"Alibi CMS Server; index: 15" {
        nat-type ipv4;
        from untrust;
        source any;
        to untrust;
        to-interface  ;
        destination 0.0.0.0;
        service 0:tcp/any/456;
        translate-to "dst: 192.168.1.2:8000";
        terminal no;
}

"Alibi CMS HTTP; index: 16" {
        nat-type ipv4;
        from untrust;
        source any;
        to untrust;
        to-interface  ;
        destination 0.0.0.0;
        service 0:tcp/any/789;
        translate-to "dst: 192.168.1.2:80";
        terminal no;
}

Using this example output, we can see that 0.0.0.0 does not match 12.34.56.78.  In my experience, the firewall will show 0.0.0.0 if the NAT policy engine has no previous resolution of the FQDN being used.  Whereas, if your external IP has recently changed, the NAT policy engine will show the old IP.

A quick way to determine the impact of this issue on your firewall is by running debug device-server dump fqdn type policy.
admin@Hossy_PA-220> debug device-server dump fqdn type policy


(VSYS1) hossy-nas01.mydomain.tld
---------------------
        12.34.56.78


        SECURITY
        dst : Alibi CMS(6), Alibi CMS RTSP(7),

        NAT
        dst : Alibi CMS RTSP(14), Alibi CMS Server(15), Alibi CMS HTTP(16)

        QOS
        dst : 
This command shows all the Security, NAT, and QoS policies that are using a given FQDN.  While it does not help you fix the problem, it can tell you what will be impacted if you encounter the problem.  One thing to note here is that the IP reported in this command is coming from the dns-proxy and not the NAT policy engine.

Workaround

So, how do you fix it?  Well, manually, of course.  One way is to do a blank commit to the firewall.  You can also generate a tech support file.  However, the easiest and quickest way I have found is to tell the firewall to refresh all address objects by running debug device-server trigger AddrObjRefresh.
admin@Hossy_PA-220> debug device-server trigger AddrObjRefresh

 AddrObjRefresh Job is scheduled


admin@Hossy_PA-220> show jobs all

Enqueued              Dequeued           ID  PositionInQ                              Type                         Status Result Completed
------------------------------------------------------------------------------------------------------------------------------------------
2019/07/24 08:55:36   08:55:36           87                                 AddrObjRefresh                            ACT   PEND        0%
2019/07/24 08:34:28   08:34:28           86                                       WildFire                            FIN     OK 08:34:57
2019/07/24 08:34:08   08:34:08           85                                         Downld                            FIN     OK 08:34:12

admin@Hossy_PA-220> show jobs id 87

Enqueued              Dequeued           ID                              Type                         Status Result Completed
------------------------------------------------------------------------------------------------------------------------------
2019/07/24 08:55:36   08:55:36           87                    AddrObjRefresh                            ACT   PEND        0%
Warnings:
Details:


admin@Hossy_PA-220> show jobs id 87

Enqueued              Dequeued           ID                              Type                         Status Result Completed
------------------------------------------------------------------------------------------------------------------------------
2019/07/24 08:55:36   08:55:36           87                    AddrObjRefresh                            FIN     OK 08:56:32
Warnings:
Details:Fqdn Refresh job successful
The job takes about a minute to complete and, based on my experience, whether or not the job reports successful or failed is irrelevant.  Even in the case where the job failed, the NAT policy engine was updated and the NAT's started working again.

Reproduction

Now that we have a solid workaround to the problem, how do we reproduce the issue on demand in order to troubleshoot it further?  This has been the most daunting task because up until recently, I thought the only way to reproduce it was to perform a PAN-OS upgrade.  On a PA-220, that takes quite a bit of time.

The easiest way to reproduce this issue is to reboot the firewall.  I first noticed it doing a PAN-OS upgrade to 9.0.1.  Subsequent troubleshooting has shown me that rebooting the firewall also causes the issue.  However, my most recent troubleshooting has yielded a faster method, especially if you have a lab environment where you can control the DHCP server and, specifically, the lease time.

I believe the overall idea is to cause the FQDN to change during runtime without performing a commit on the firewall.  With a low TTL, the dns-proxy will update the IP in its database, but the NAT policy engine will only have the IP from the last time a commit was done (or 0.0.0.0 in the case of a reboot).  The Minimum FQDN Refresh Time and FQDN Stale Entry Timeout options (link) introduced in 9.0 do not seem to have any impact on updating the NAT policy engine.

The Faster Method

The goal here is to cause the DHCP server to issue the firewall a new IP which then causes the FQDN to be updated with a new IP.  Here is how I did it:
  1. Identified the DHCP lease time on my ISP's DHCP server is 30 minutes.
  2. When the remaining lease time was just over 15 minutes, told the firewall to "release" the IP.
    • Sending the release command to the DHCP server did not return the IP to the pool.  The DHCP server remembered the IP assignment, so I had to wait for the lease to fully expire.
  3. Waited the remaining lease time (~15 minutes), plus a little more.
  4. Told the firewall to "renew" the IP.  This gave me a new IP address.
  5. Waited for my automation to update the IP of the FQDN.
At this point, I confirmed by running show dns-proxy fqdn all that the firewall knew about the new IP and also that, by running show running nat-policy, the NAT policy engine was not updated.  Only by running debug device-server trigger AddrObjRefresh was the NAT policy engine updated which then caused the NAT's to work again.

Popular posts from this blog

Backup the Synology Task Scheduler

Welcome to the Matrix