Last updated: 
2 months 2 weeks ago
Group Manager
A place to share information on all aspects of eduroam in the UK. Follow us on Twitter @eduroamuk - for news, interest, information, photos and fun. Contents Click on item and scroll down to the selected content at the bottom of the page. Guidance document - Cost of Implementing eduroam eduroam(UK) Technical Specification Summary of Recommendations Checklist eduroam(UK) Technical Specification Summary of Requirements Checklist eduroam(UK) Technical Specification NHS and eduroam/shared use of wireless/govroam ORPS in Azure - alternatives to the use of ICMP Sending Operator Name with Cisco ISE 2.0 eduroam in Public Buildings and Spaces in City Centres TLS 1.2 and updated RADIUS requirements FreeRADIUS Packet Handling - examining the flow FreeRADIUS Best Current Practice Configuration for eduroam  Performance tweaks for RADIUS and backend authentication systems eduroam(UK) Microsoft NPS Configuration Guide v0.1 eduroam(UK) Service Provider Assurance Tool User Guide eduroam(UK) Service Provider Assurance Tool Phase2 Field Trial Feedback Improving the Reliability of NPS as an Authenticator in eduroam Advisory: Using Status Server Advisory: Use of MD5 Certificates Deprecated in Favour of SHA-1 for RADIUS servers Advisory: Windows Mobile 8 and Certificate Verification NWS41 eduroam Forum presentations - TKIP, CUI, NAPTR, QoS Probe NWS40 FreeRADIUS Demystified seminar presentation Geant Funding available Janet Lumen House eduroam Service Information UK eduroam Usage Feb 2013 EAP-pwd Moving Towards a Deployable Standard Site Finder and Service Information Directory eduroam(UK) Technical Specification 1.3 (archived) - superseded by 1.4 eduroam User Troubleshooting Flowchart for IT Support Staff eduroam Administrators Troubleshooting Flowchart NAPTR Record Creation Using Microsoft Windows 2008 R2 DNS Server eduroam Best Practice Pointers FreeRADIUS 2 eduroam Deployment at University of Sussex

Group administrators:

Improving reliability of Microsoft NPS as an authentication provider for eduroam

1 August 2014 at 11:41am

I've spent a fair bit of time over the past month trying to improve the reliability of our RADIUS service for eduroam.  Previously it was entirely based on Microsoft NPS which has the tendency to silently discard authentication packets which it should really be rejecting. This creates a problem because if the authentication request originated from outside of your network (i.e. a roaming user authenticating from a remote organisation) the RADIUS server will appear to be non-responsive as far as the JANET NRPS is concerned and will be marked as offline for 300s.  Once your RADIUS servers are marked as offline legitimate requests will not be sent to them and will fail.  As much as I would have liked to replace the NPS servers with FreeRADIUS altogether, they do integrate with some other services we provide written for NPS so that wasn't possible.

I spent a fair bit of time trying to work out why our NPS servers were discarding packets at all.  In case it's of any use to anyone, I made a few discoveries and subsequent changes to our RADIUS infrastructure to try and alleviate the problems, as described below:

1.  NPS can discard RADIUS authentication requests if they contain invalid attributes.  It seems to depend upon how NPS determines whether the request is invalid as to whether it rejects or silently discards the request.  Since there is no attribute filtering ability within NPS I ultimately decided to create two new FreeRADIUS servers which function purely as proxy servers between our NPS servers and the JANET NRPS.  I filtered out all but the following inbound (remembering that NPS doesn't recognise operator-name):

        User-Name =* ANY,
        Reply-Message =* ANY,
        State =* ANY,
        Class =* ANY,
        Message-Authenticator =* ANY,
        Proxy-State =* ANY,
        EAP-Message =* ANY,
        MS-MPPE-Send-Key =* ANY,
        MS-MPPE-Recv-Key =* ANY,
        Calling-Station-ID =* ANY,
        Chargeable-User-Identity =* ANY,
        NAS-IP-Address =* ANY,
        Framed-MTU >= 576,
        NAS-Identifier =* ANY

We're being a good eduroam institution and we also filter outbound RADIUS messages to only contain the following attributes:

        User-Name =* ANY,
        Reply-Message =* ANY,
        State =* ANY,
        Class =* ANY,
        Message-Authenticator =* ANY,
        Proxy-State =* ANY,
        EAP-Message =* ANY,
        MS-MPPE-Send-Key =* ANY,
        MS-MPPE-Recv-Key =* ANY,
        Calling-Station-ID =* ANY,
        Operator-Name =* ANY,
        Chargeable-User-Identity =* ANY,
        NAS-IP-Address =* ANY,
        Framed-MTU >= 576,
        NAS-Identifier =* ANY

This also prevents stray VLAN attributes from reaching our NPS servers which was causing us some problems if we were visited by users from other organisations that sent access-accept messages with VLAN information which is only valid at the user's home site.

The FreeRADIUS proxy servers also allow us to inject the Operator-Name attribute into outbound RADIUS packets.  This was not possible with NPS either.

2.  Unfortunately we were still seeing many silent discards on our NPS servers which were causing timeouts on the JANET NRPS.  As a temporary solution I resorted to configuring FreeRADIUS to send back access-rejects if our NPS servers did not respond within 8s (since the NRPS timeout seems to be 10s).  In tandem with this I configured FreeRADIUS to send authentication requests to test the availability of the NPS servers as soon as they were found to be non-responsive (zombies in FreeRADIUS speak).  This ensured I didn't break failover of our NPS servers.

3.  Next I focused on trying to fix the discard errors on NPS.  This was not so easy.  Firstly, we were seeing events discarded with the error message "Authentication failed due to an EAP session timeout; the EAP session with the access client was incomplete".  These were mainly caused if the wireless device was connected with a very poor signal strength which prevented the full EAP communication from taking place.  The solution to this problem was to disable the lowest data rates available to clients connecting to eduroam on our access points.  I found that disabling the 1.0 and 2.0 data rates was sufficient to clear up 99% of the failures from internal devices.  Most of the events that remain relate to connections at other organisations so there's nothing I can do about those.

4.  Most of the other discards had the error message "An internal error occurred. Check the system event log for additional information." (don't you just hate those, especially when the system event log had nothing logged!).  It turns out that almost all of these were caused by some Apple and Android devices connecting with bad passwords.  We had the number of re-authentication attempts set to 3, and with this configuration in place a device would offer the user the option of changing their password.  If they entered the correct password it would still fail to connect because the NPS servers would discard the events.  It looks like the requests were discarded because the "SequenceID" in the RADIUS packet hadn't been incremented.  Not all devices behave that way and the only workaround I've found so far is to change the re-authentication attempts to 0.  It's not an ideal solution, but at least it stops the NPS server discarding these events as it starts rejecting instead.

I believe we do still get a small number of discards each day, and these I'll look into as they arise.

5.  To try and improve the performance and availability of our NPS servers I've made a couple of other registry changes, as follows:

Set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL to 3.  This increases SCHANNEL logging.

Set HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Netlogon\Parameters\MaxConcurrentApi to 5.  This increases the available authentication threads to the DCs since our RADIUS servers are not Domain Controllers.

According to the JANET NRPS logs it seems that the only remaining errors we have relate to users with typos in their username or with bad or expires passwords.  To try and reduce the problem we're producing daily RADIUS failure reports which are now being reviewed by our helpdesk staff so they can contact the worst offenders.  Of course, it's impossible to remove all these failures but anything is better than nothing.

Regards

Simon

Comments

Something I've found out recently is that by default NPS uses a maximum size of 2000 bytes for its datagrams. If it's sending a certificate (whose size will probably exceed this) the packet involved will therefore be fragmented, and sometimes this will cause the packet to be lost in transit, so the EAP interaction is never completed. This causes NPS to log a discard, for some reason claiming that the incoming packet was incorrectly formatted. Even if the client sends a Framed-MTU attribute itself, NPS will ignore it. However, if you set the Framed-MTU attribute in the network policy involved, NPS will use the value you specify for its own packets.

We were seeing this problem initially with responses to the test authentication requests that JANET sends every few minutes. I got the details from a Cisco article (http://www.cisco.com/c/en/us/support/docs/lan-switching/8021x/118634-technote-eap-00.html). Since implementing the change, I've noticed that authentication attempts from a number of clients which were failing previously are now working.