Status: The Present

Statuses current as of 0 seconds ago.
Next refresh in 0 seconds.
Server time on pageload: 2010-09-02 08:02:27 -0700
 

Status: The Future

There are currently no scheduled maintenance windows.

Status: The Past

 

Thursday, Sep 2nd

2010-09-02 2:03am2:10am -07:00 (PST) 2010-09-02 09:032010-09-02 09:10 +00:00 (GMT)

APP Connectivity Issues due to Backups

Impacted Systems: app, admin

We just experienced an issue impacting customer sites. We are exploring further and will update here when more info is available.

Update: It appears as though a scheduled backup impacted performance. We will be exploring to ensure this does not happen again.

 

Monday, Aug 30th

2010-08-30 8:30pm11:59pm -07:00 (PST) 2010-08-31 03:302010-08-31 06:59 +00:00 (GMT)

Regular Maintenance to free up space on Database Server

Impacted Systems:

We’ll be conducting some regular maintenance tonight to free up additional disk space on our primary database server. We anticipate no down time.

 

Thursday, Aug 26th

2010-08-26 12:32pm12:39pm -07:00 (PST) 2010-08-26 19:322010-08-26 19:39 +00:00 (GMT)

Downtime on WWW, Forum

Impacted Systems: www, forum, affiliate

We just experienced 6 minutes of downtime on our WWW, FORUM, and AFFILIATE systems due to a disk issue. No data was lost, and this did not impact APP or ADMIN at all. We apologize for the inconvenience.

 

Tuesday, Aug 24th

2010-08-24 9:08pm9:14pm -07:00 (PST) 2010-08-25 04:082010-08-25 04:14 +00:00 (GMT)

Unexpected Downtime for APP

Impacted Systems: app

We experienced about 6 minutes of downtime due to an administrative database query taking too long to respond which ended up blocking other queries causing the app to fail to respond. These types of queries will only be run on backups in the future to avoid any possible locks of the production system.

 

Tuesday, Aug 24th

2010-08-24 8:33am8:38am -07:00 (PST) 2010-08-24 15:332010-08-24 15:38 +00:00 (GMT)

Unexpected slowdown on APP

Impacted Systems: app, admin

Our primary app server experienced an unexpected slowdown. We served traffic during the affected window, but at a much lower than normal rate. Please be patient, we are investigating the cause in order to prevent its recurrence.

 

Friday, Jul 30th

2010-07-30 6:56pm6:58pm -07:00 (PST) 2010-07-31 01:562010-07-31 01:58 +00:00 (GMT)

foxycart_includes.js temporarily unavilable

Impacted Systems: app

Also due to some code changes related to upgrading our billing system, for two minutes the foxycart_includes.js file which contains jQuery was unavailable. If your site required jQuery to function correctly and you were using the jQuery included in that file your customers may have been impacted. This error on our part was due to a limitation in our current QA setup which did not bring the error to our attention prior to deploying the changes in production. We’re working to improve our development systems to avoid these types of errors in the future.

 

Friday, Jul 30th

2010-07-30 4:53pm4:54pm -07:00 (PST) 2010-07-30 23:532010-07-30 23:54 +00:00 (GMT)

Momentary WWW Downtime

Impacted Systems: www

Due to human error while upgrading the FoxyCart billing system the WWW site was down for about one minute. This did not impact any other systems but the www.foxycart.com site (ie. not any FoxyCart stores or any other systems).

 

Thursday, Jul 15th

2010-07-15 1:01pm1:09pm -07:00 (PST) 2010-07-15 20:012010-07-15 20:09 +00:00 (GMT)

High Latency on app3

Impacted Systems: app, admin

Under normal traffic, app3’s response times unexpectedly slowed. Further investigation shows that Apache’s memory management may be at fault, but we have nothing definitive yet. We are continuing to investigate, and we will post here when we know more.

 

Wednesday, Jun 23rd

2010-06-23 2:00am2:30am -07:00 (PST) 2010-06-23 09:002010-06-23 09:30 +00:00 (GMT)

Replacing APP2 with APP3

Impacted Systems: app, admin

APP2, the APP server responsible for serving the bulk of FoxyCart traffic since May, is believed to have a RAID controller issue that has caused or contributed to the repeated latency and downtimes over the past few weeks. This maintenance window is to replace APP2 with APP3, which has been stress tested and equipped with much faster disks.

Once APP3 is online and serving traffic we can diagnose APP2. After APP2 is diagnosed we will proceed with plans to improve redundancy, which have heretofore been stalled by suspected faulty hardware.

We expect this to be a relatively brief maintenance window, but we are scheduling 30 minutes in case things do not go according to plan. If thing’s don’t go according to plan and APP3 is not brought online we will leave APP2 as the primary web server until further notice.

UPDATE 02:00 PST: DNS has been flipped to the failover systems.

UPDATE 02:20 PST: DNS has been flipped back. It may take up to 5 minutes for service to completely restore to all users.

 

Tuesday, Jun 22nd

2010-06-22 7:22pm7:31pm -07:00 (PST) 2010-06-23 02:222010-06-23 02:31 +00:00 (GMT)

Unexpected Downtime on APP (continued)

Impacted Systems: app, admin

The issues we ran into yesterday regarding APP losing connectivity just happened again today. We have been working on the new server but were hindered by what appeared to be faulty hardware but may be related to specific virtual environment setup conflicts. We’re attempting a different setup on the new server and may move forward with an emergency downtime tonight to move off of our failing APP server. Stay tuned.

 

Monday, Jun 21st

2010-06-21 1:28pm11:15pm -07:00 (PST) 2010-06-21 20:282010-06-22 06:15 +00:00 (GMT)

Unexpected downtime on APP

Impacted Systems: app, admin

We had an issue with Apache memory consumption on APP. The memory consumption ballooned out of control and the server stopped responding to requests. In order to get things under control, we restarted APP and it came back up as expected.

We have adjusted Apache’s memory settings to keep things below the memory limits on the box. Our apologies for the downtime. We’ve tuned our current system the best that we can, and we’re doing our best to bring up a new, bigger server to handle the load.

UPDATE 13:40 PST: This issue appears to be a little deeper than it appeared. We are still working on resolving sporadic latency issues, and will update further as we make progress.

UPDATE 18:32 PST: We are continuing to diagnose the problems we encountered today, and have just resolved a routing issue that impacted a small handful of stores using custom subdomains. (Affected stores will be emailed.)

UPDATE 20:17 PST: We are considering an emergency maintenance window for this evening in order to make a hardware switch. It will be to a system with less power, but the system is known to be working, whereas we have concerns about the disk controller on one of the current production servers. We will update again when we have a maintenance window set.

UPDATE 23:00 PST: We have just obtained another server that we plan to use instead of the less powerful system mentioned above. In light of this new system we have decided to complete testing on that, and replace the ailing production server as soon as it has been thoroughly tested. We will update back here as progress is made.

RECAP FOR 2010-06-21 (23:15 PST): There were four separate incidents of downtime today, and throughout some of the evening FoxyCart responded very slowly to more requests. Further, some stores using custom subdomains were non-responsive after the last downtime. (Our monitoring, which does monitor custom subdomains, did not catch this issue as most custom subdomains were fine. We are addressing this oversight to prevent it from happening in the future.) We are working feverishly on addressing all issues as mentioned above. At present, all systems are fully functional and responding quickly, as they have been for the past 4 hours

 

Monday, Jun 7th

2010-06-07 11:30pm11:35pm -07:00 (PST) 2010-06-08 06:302010-06-08 06:35 +00:00 (GMT)

Adding IP addresses to the primary app server

Impacted Systems: app, admin

We are adding a new block of IPs to our primary app server in order to serve more Custom Subdomain customers. We do not anticipate any downtime, and have added IP addresses many times before to the running server without issue. However, given our latest server issues, we wish to be extremely cautious while performing what is normally a very simple and low-impact operation.

Update: 11:55PM PST All IP addresses were added as planned, and there was no downtime during the process. Everything went better than expected!

 

Friday, Jun 4th

2010-06-04 11:30pm11:53pm -07:00 (PST) 2010-06-05 06:302010-06-05 06:53 +00:00 (GMT)

Network Switch Upgrade

Impacted Systems:

In order to accommodate hardware upgrades and additional servers in our network, it is necessary to upgrade the switch connecting our existing servers. We do not anticipate any perceived downtime, but given the week we have had, we are prepared for the worst and have scheduled an hour maintenance window. We will update here as we make progress, and when it is completed.

UPDATE 23:43 PST: Cables are being switched. We turned on DNS failover to prevent any other error messages that may pop up.

UPDATE 23:44 PST: The new switch isn’t lighting up with activity as expected. Replacing now.

UPDATE 23:50 PST: Switch replaced. Connectivity restored. Testing before switching DNS back.

UPDATE 23:53 PST: All systems go. The issue with the first switch was a mismatched power adapter, we are told.

 

Thursday, Jun 3rd

2010-06-03 1:37pm8:52pm -07:00 (PST) 2010-06-03 20:372010-06-04 03:52 +00:00 (GMT)

Network Issues on Primary APP Server

Impacted Systems: app, admin

We are currently experiencing issues affecting our connectivity. We will update here as we get info.

UPDATE 20:12 PST: We are still working on diagnosing the issue, though we believe we have determined the cause. We also placed an order a few hours ago for a new physical server to add to our cluster in order to provide increased reliability.

We experienced 3 connectivity losses today, including one that impacted custom subdomains. Each incident was between 5-8 minutes long, and custom subdomains may have been impacted for longer.

 

Wednesday, Jun 2nd

2010-06-02 11:30pm12:35am -07:00 (PST) 2010-06-03 06:302010-06-03 07:35 +00:00 (GMT)

Brief Maintenance Interval to upgrade app server

Impacted Systems: app, admin

It appears that the issues we’ve recently had on our new app server have to do with a disk I/O limitation in the underlying hardware. In order to improve performance and diagnose the problem further, we are adding an additional disk to the app server with greater I/O bandwidth.

UPDATE 23:48 PST: The upgrade did not go as planned. We are working feverishly to restore service.

UPDATE 23:54 PST: The problematic app server has been successfully reverted.

UPDATE 00:11 PST: The problem app server appeared to have an issue with the RAID controller. It is back online and we are diagnosing before turning it back on. We’re very, very sorry for the inconvenience.

UPDATE 00:35 PST: Service has been restored. DNS has a 5 minute TTL so requests may take a moment for DNS cache to clear.

 

Friday, May 28th

2010-05-28 11:32am11:38am -07:00 (PST) 2010-05-28 18:322010-05-28 18:38 +00:00 (GMT)

Momentary downtime

Impacted Systems:

We just experienced another issue on our primary APP server from 11:32-11:38am PST. We are exploring this issue and will post back with details. We do apologize for us having what has been the worst week ever with regard to these connectivity issues.

UPDATE 12:06 PST: At this point it appears that the issue stemmed from the Linux kernel downgrade we performed this past weekend. IP addresses were added to the server at 11:32, which coincided with the downtime. Typically, adding IPs to a server isn’t problematic, and we have done so many, many times in the past without issue. Unfortunately, we may have hit an underlying limit in the network stack, which caused problems. Instead, we will plan to add the IP addresses tomorrow night with a brief downtime to ensure that all is functioning smoothly.

We sincerely apologize for this error on our parts, especially as it happened right in the middle of the day. As always, please let us know if you have comments, questions, or concerns on our forum.

 

Tuesday, May 25th

2010-05-25 6:05am9:24am -07:00 (PST) 2010-05-25 13:052010-05-25 16:24 +00:00 (GMT)

High Load on app.foxycart.com

Impacted Systems: app, admin

We are currently experiencing high load on app.foxycart.com, our primary app server. We are working hard to determine the cause of the load and reduce or eliminate it to return things to normal. We will update here as soon as we know more.

 

Monday, May 24th

2010-05-24 2:07am2:21am -07:00 (PST) 2010-05-24 09:072010-05-24 09:21 +00:00 (GMT)

Unexpected Downtime for APP: Related to scheduled maintenance

Impacted Systems:

Related to the scheduled maintenance earlier in the evening, a test script on run on a dev webserver wasn’t terminated correctly, leading to an increased load and ultimately a non-responsive database server when nightly maintenance scripts were run. The dev webserver was prepared for the previous window’s kernel downgrade and was tested, but it had insufficient resources to handle the stress test. This caused the server to slow sufficiently to not quit the tests when commanded, and to keep the requests generated by the request open for much longer than expected.

This was human error, however, and was not caused by the bug that caused earlier bouts of downtime this weekend.

We will be completing an incident report internally, and will update this thread with additional information as it becomes available.

 

Sunday, May 23rd

2010-05-23 11:00pm11:20pm -07:00 (PST) 2010-05-24 06:002010-05-24 06:20 +00:00 (GMT)

Scheduled Maintenance for Linux Kernel Downgrade

Impacted Systems:

We will be downgrading the Linux kernel in order to address what appears to be a potential incompatibility with Xen, related to the issue experienced on Thursday.

We do not anticipate any issues with this downgrade, as we have other systems using the specific version very reliably, but in the event something does go wrong we will be ready to revert and come back online immediately. We will update here as it happens, and as soon as the server is successfully back serving traffic.

UPDATE 23:06 PST: Kernel downgrade completed, tested, and DNS switched back to point to production servers instead of failover servers. Depending on when you last made a DNS query DNS may take a few minutes to update.

 

Saturday, May 22nd

2010-05-22 2:27pm2:56pm -07:00 (PST) 2010-05-22 21:272010-05-22 21:56 +00:00 (GMT)

Unscheduled Maintenance Window on APP and ADMIN

Impacted Systems:

FoxyCart’s primary app server experienced a sudden and very high load, increasing latency and causing timeouts for some requests. We are still exploring the causes of these sudden spikes, and will update again once we have more details.

We are also working on methods to handle sudden traffic spikes in a more graceful manner, including improving load balancing strategies and exploring different web servers.

 

Friday, May 21st

2010-05-21 11:11am11:20am -07:00 (PST) 2010-05-21 18:112010-05-21 18:20 +00:00 (GMT)

Downtime on APP

Impacted Systems: app, admin

Our primary APP server is currently experiencing unexpected downtime. Please standby as we resolve and determine the cause.

UPDATE 11:19 PST: APP and ADMIN are back up. We are looking into the cause for the downtime and will update as soon as details are available.

UPDATE 12:21 PST: We have done a preliminary investigation and have failed to determine the cause. We know that one of our virtual machines responsible for serving traffic failed in a catastrophic way. The failure occurred on the recently added physical server. At this point we believe that it’s a software issue, but we are continuing to diagnose.

UPDATE 13:33 PST: We are currently analyzing whether or not the specific version of the Linux kernel in the VM distribution we are using may have kernel issues with our virtualization environment.

 

Monday, May 17th

2010-05-17 11:30pm11:40pm -07:00 (PST) 2010-05-18 06:302010-05-18 06:40 +00:00 (GMT)

Database Server Memory Upgrade Window

Impacted Systems: app, admin

In order to minimize potential issues with our primary database server we will be allocating more memory to it during this maintenance window. To make this happen a server restart is necessary. Because most of our systems are virtualized, we can accomplish this very quickly, so we do not anticipate more than a few minutes of downtime. If any issues are encountered the systems will be reverted to their initial settings.

 

Monday, May 17th

2010-05-17 5:41pm5:51pm -07:00 (PST) 2010-05-18 00:412010-05-18 00:51 +00:00 (GMT)

Unexpected Downtime for APP

Impacted Systems: app, admin

We experienced just under 10 minutes of unexpected downtime due to a database locking issue attributed to user error on our part. The fail over system kicked in which redirected admin to status.foxycart.com and all cart and checkout requests to the temporary store down page. Once the system was back up, DNS entries were refreshed and normal activity resumed.

 

Saturday, May 1st

2010-05-01 11:00pm11:59pm -07:00 (PST) 2010-05-02 06:002010-05-02 06:59 +00:00 (GMT)

Adding new server to APP environment, Part 3

Impacted Systems: app, admin

Related to Parts 1 and 2, we are scheduling another maintenance window to bring a new physical server into our network. The new physical server will take over as the primary APP server for all FoxyCart stores, and has roughly four times the physical resources as the previous primary APP server.

During this downtime, all *.foxycart.com requests will be routed to failover.foxycart.com and status.foxycart.com. We do, however, recommend all stores on v060 take advantage of the new Content Delivery Network (CDN) functionality, which will yield not only a faster load time but also improved performance during this scheduled maintenance window. More info on that is in our documentation and our forum.

UPDATE 23:22 PDT: Everything is proceeding as planned. IP address are being rerouted now.

UPDATE 00:04 PDT: We’re back online, and exploring a potential email routing issue.

UPDATE 00:12 PDT: The email routing has been corrected. (No emails were lost.) Live transactions are coming in. The new server has been added to our environment successfully.

 

Tuesday, Apr 27th

2010-04-27 1:05pm1:55pm -07:00 (PST) 2010-04-27 20:052010-04-27 20:55 +00:00 (GMT)

Sporadic Momentary Connectivity Losses

Impacted Systems:

We are currently experiencing sporadic bouts of connectivity losses. We are exploring the situation and will update when it is diagnosed and resolved.

14:11 PDT: We’ve made adjustments to our primary APP server’s memory handling, and are in the process of adding more hardware to prevent issues in the future.

 

Monday, Apr 26th

2010-04-26 11:05am11:20am -07:00 (PST) 2010-04-26 18:052010-04-26 18:20 +00:00 (GMT)

High Latency on APP, Part 2

Impacted Systems: app, admin

We are getting close to pushing out another physical server to handle increased load, but we’re currently experiencing high load and latency. We are working to resolve the issue.

We recommend all FoxyCart stores using v060 move their foxycart_includes.js, foxybox.css, and any other CSS files coming from FoxyCart to use the CDN, which will eliminate any load delays for sites using FoxyCart’s includes. More info on that is in our wiki (with a discussion on our forum).

UPDATE 11:24 PDT: We’re back.

 

Monday, Apr 19th

2010-04-19 2:55pm2:00pm -07:00 (PST) 2010-04-19 21:552010-04-20 21:00 +00:00 (GMT)

High Latency on APP

Impacted Systems: app, admin

At about 14:55 PDT (19:55 GMT) our primary APP server started experiencing high latency. We are currently working hard to diagnose and correct the issue.

UPDATE 15:26 PDT: We have made some server configuration changes that have dramatically improved response times. We are continuing to work on the situation.

UPDATE 15:40 PDT: We overdid a performance tweak, which caused problems in its own. We are reverting that change and continuing to monitor the situation.

UPDATE 15:46 PDT: We reverted that change and brought things back under control, as well as made another small change to request limits per client.

UPDATE 2010-04-20 09:00 PDT: We are currently experiencing another massive spike in traffic related to issues yesterday that we are attempting to deal with.

UPDATE 09:35 PDT: We have fully recovered at this point. The problem this morning had to do with us making some server changes too aggressively yesterday, which caused problems with memory usage when we were faced with an even greater server load.

We are continuing to work on bringing a new physical server on board to handle massive traffic spikes, and we will update here and on our forum when that happens.

UPDATE 13:47 PDT: We are currently exploring a physical memory upgrade and an emergency maintenance window to prevent further problems while we work on bringing a new physical server into the cluster.

 

Wednesday, Apr 7th

2010-04-07 2:11am2:41am -07:00 (PST) 2010-04-07 09:112010-04-07 09:41 +00:00 (GMT)

Connectivity Issues, 7 April

Impacted Systems: admin

For approximately 30 minutes early this morning we had some temporary connectivity issues related to a network share on our production server. During that time, all stores were intermittently available while we found and corrected the problem. We’ve taken additional measures to improve the speed and reliability of the network share. Our apologies for any inconvenience that this may have caused.

 

Sunday, Apr 4th

2010-04-04 4:32am5:28am -07:00 (PST) 2010-04-04 11:322010-04-04 12:28 +00:00 (GMT)

Connectivity Issues, Easter Morning

Impacted Systems: app, admin

This morning between 04:32 and 05:28 PDT (-07:00 GMT) FoxyCart’s primary application server experienced connectivity issues related to disk space on one of our database servers. As is often the case, a series of small and unnoticed issues snowballed into a larger one. In this case, our internal monitoring system did attempt to send a notification of the disk space issue, but the message was temporarily declined by our SMTP server. Unfortunately, this notification wasn’t delivered, and the result was two systems (a database server and a backup partition) going back and forth until the problem affected our primary app server.

No data was lost, and our DNS level failover system responded as expected. We will be exploring ways to increase the reliability and redundancy of our internal systems monitoring application. We sincerely apologize for any inconvenience this has caused.

UPDATE 2010-04-08: Some transaction data was in fact lost during this outage for a small number of FoxyCart users. We are following up via email with all impacted users.

 

Saturday, Mar 13th

2010-03-13 11:00pm11:59pm -07:00 (PST) 2010-03-14 07:002010-03-14 07:59 +00:00 (GMT)

Adding new server to APP environment, Part 2

Impacted Systems: app, admin

We are adding a new physical server to our production hosting environment, which should allow us to continue to grow without load or latency issues. Due to the nature of the changes we’re making we will need to take APP and ADMIN offline briefly to make some networking changes and to bring the new server into the mix. We will be splitting this into two maintenance windows.

We do not expect problems, and if we encounter any we will revert all configuration changes.

UPDATE 2010-03-13 16:00 PST: To add to the above explanation, one of our servers will be getting a new network interface card to replace a faulty card which has caused recent downtime (as noted in other status updates), which requires a shutdown while we make the hardware adjustment.

UPDATE 2010-03-13 23:24 PST: We just switched DNS back to our production environment. We have confirmed that the previous network interface card was indeed the cause of the problems aforementioned. We are running additional diagnostics to ensure things are 100%, but at this point it looks as though the maintenance window is closed.

 

Wednesday, Mar 10th

2010-03-10 4:52pm4:53pm -07:00 (PST) 2010-03-10 23:522010-03-10 23:53 +00:00 (GMT)

Network Interface Issue on APP

Impacted Systems: app, admin

While setting up the internal networking for a new server, we encountered a bug in the networking driver for the backup interface which had no effect on store connectivity. However, when we attempted to correct the error, this resulted in a private network interface unexpectedly taking the public network address for admin.foxycart.com and *.foxycart.com, which did cause a service interruption. As we were logged into the server at the time, we were able to fix problem in under a minute and restore connectivity.

Stores using custom subdomains were not affected.

We are working with our hosting provider on diagnosing a potential hardware issue which may have caused this unexpected behavior.

 

Sunday, Mar 7th

2010-03-07 1:15pm11:07pm -07:00 (PST) 2010-03-07 20:152010-03-08 06:07 +00:00 (GMT)

Unexpected High Latency on APP, Custom Subdomains Non-Responsive

Impacted Systems: app, admin

At 13:15 PST our primary application servers started experiencing unusually high loads. We are currently looking into the cause.

At this point it appears to be entirely unrelated to the issues affecting our AUX systems (www, wiki, forum), which were the result of a DDoS at the datacenter (Slicehost / Rackspace, DFW datacenter). We will update as soon as we have more details.

UPDATE 13:40 PST: We made some system adjustments but are still experiencing issues.

UPDATE 13:51 PST: It looks as though a network drive connection failed, causing highly aberrant behavior throughout the production environment. We’re examining further and will continue to monitor, but at this point it appears that all systems have been restored.

UPDATE 16:01 PST: We will continue to diagnose what went wrong, and we will attempt to reproduce the issue on our dev platform. Explanations we are looking into are:

  • The network drive in question may have come back online with an issue as a result of the maintenance last night.
  • The Slicehost DDoS impacted our monitoring system (located at Slicehost), and when that came back online it somehow impacted our production environment (at a different datacenter), which it monitors.

UPDATE 23:07 PST: While we believed all issues to be resolved as of 13:51 PST, a missing startup script on the application server rebooted earlier has been causing issues for all custom subdomains since then. We will followup on our forum or by email to impacted accounts.

 

Saturday, Mar 6th

2010-03-06 11:00pm11:59pm -07:00 (PST) 2010-03-07 06:002010-03-07 06:59 +00:00 (GMT)

Adding new server to APP environment, Part 1

Impacted Systems: app, admin

We are adding a new physical server to our production hosting environment, which should allow us to continue to grow without load or latency issues. Due to the nature of the changes we’re making we will need to take APP and ADMIN offline briefly to make some networking changes and to bring the new server into the mix. We will be splitting this into two maintenance windows.

We do not expect problems, and if we encounter any we will revert all configuration changes.

 

Thursday, Mar 4th

2010-03-04 3:26pm11:26pm -07:00 (PST) 2010-03-04 22:262010-03-05 06:26 +00:00 (GMT)

DNS Failover Issues

Impacted Systems: app

At 3:26pm PST today FoxyCart’s APP environment’s webserver experienced a momentary lapse in service. Though the window was only about 30 seconds, it triggered our DNS-level failover monitoring, which redirected all traffic for *.foxycart.com to our failover system.

Unfortunately, our DNS failover was misconfigured for *.foxycart.com, and did not revert back to the primary IP address immediately. Instead, this failover A record persisted for almost an hour. Further, the TTL on the failover A record was set to 28800 instead of the 300 it should have been.

The result was that visitors querying a *.foxycart.com domain for the first time in the previous 8 hours, during the one hour window after the initial failover, would have received an incorrect A record with an 8 hour TTL. These users, unfortunately, will not be able to proceed through to a cart or checkout, but should instead get a notice that the e-commerce functionality is currently undergoing maintenance. Depending on the store settings, HTTPS requests may fail with an invalid certificate warning, which is the result of a misconfiguration on the failover system. Unfortunately, due to the cached IP address, this error is not something that can be resolved for users already affected.

Stores using custom subdomains were not affected.

The situation will resolve itself automatically for all affected users as soon as the DNS cache clears. Unfortunately, due to the nature of DNS, there is nothing we can do to speed up the process at this point.

If you have been affected and need access immediately we suggest using Google Public DNS or OpenDNS.

We are painfully aware of the potential lost sales and frustration caused by this chain of events, and are examining our failover systems to ensure this does not happen again. We are also exploring adding a “soft” failover in addition to the current DNS failover functionality. We sincerely regret the inconvenience. Please let us know in our forums if you have questions or concerns.

Updated: 17:53 PST. Added explanation of SSL issues.

 

Monday, Mar 1st

2010-03-01 2:03pm3:04pm -07:00 (PST) 2010-03-01 21:032010-03-01 22:04 +00:00 (GMT)

Sporadic connectivity issues

Impacted Systems: app, admin

We are currently experiencing brief and sporadic connectivity issues due to unexpected load spikes. These connectivity lapses are ranging from a few seconds to about a minute in length.

We’re working on this and will update here as we are able.

Update, 15:02 PST: We have disabled some extended logging on the primary application server that has greatly reduced the system load. Further, we have another physical server that will be added to our systems soon, which will reduce the likelihood of load-related issues at this point.

 

Thursday, Jan 21st

2010-01-21 2:44pm9:11am -07:00 (PST) 2010-01-21 21:442010-01-22 16:11 +00:00 (GMT)

Unexpected High Latency on APP

Impacted Systems: app, admin

At 14:44:20 PST FoxyCart’s APP server started experiencing high latency. We are all-hands-on-deck working on this issue. We’ll post updates here and on Twitter, as soon as we can provide more information.

15:26 PST: We’re still experiencing very high loads on our production environment, but changes just made have improved response times. We’re still working on it and will continue to update.

15:29 PST: Rebooting one of the production web server VMs quickly to allocate more processor time.

15:35 PST: While we’re still experiencing very high load, our APP servers shouldn’t be causing catastrophic delays when serving pages at this point.

15:53 PST: We’ve made additional improvements to our APP environment, but a FoxyCart user is currently getting a very high amount of traffic that we are shuffling resources to deal with.

16:22 PST: Additional VM configuration changes required another brief period of downtime to take effect.

16:33 PST: Making another configuration change that should allow us to handle the (radically) increased requests, as well as a caching change to prevent some trips to the database.

17:34 PST: Our production environment is currently stable, but slow. We are continuing to explore options.

17:45 PST: A forum thread has been started. Additional explanation is there. If you have questions, concerns, or comments please post there.

20:45 PST: Systems have been regaining speed for hours at this point, and are back to normal. We will provide additional updates on our forum, including steps we will take to ensure a traffic spike of this magnitude does not impact our systems like this again.

02:11 PST, 2010.01.22: The primary application server experienced database connectivity issues related to events from earlier today.

06:55 PST: High latency has returned, and we are continuing to make adjustments as we wait for additional hardware to be available. We will continue to update as we can.

09:38 PST: Load has decreased. Services have returned to normal and additional slowdowns are not expected at this point. Additional hardware is on the way. Please check our “forum”: http://forum.foxycart.com/comments.php?DiscussionID=2483 for more information.

 

Friday, Dec 4th

2009-12-04 12:35pm12:55pm -07:00 (PST) 2009-12-04 19:352009-12-04 19:55 +00:00 (GMT)

Unexpected Downtime for www, forum, wiki

Impacted Systems: www, forum, wiki, affiliate

www.foxycart.com, forum.foxycart.com, and wiki.foxycart.com are all experiencing downtime due to unexpected server issue at the datacenter where these sites are located, Slicehost. Our sites should come back up as soon as Slicehost addresses the issue. We will update here as necessary.

Important to note is that this does not affect the FoxyCart system itself. Stores are still live, and the admin is functional as well.

12:55 PST: Back up. Slicehost performed an emergency reboot of the host server. While our slice did reboot, the database server didn’t appreciate things and required a clean restart on our end. We apologize for the inconvenience. Again, this outage did not affect service to the FoxyCart system itself.

 

Monday, Nov 2nd

2009-11-02 10:40pm12:07am -07:00 (PST) 2009-11-03 05:402009-11-03 07:07 +00:00 (GMT)

Unexpected Downtime for www, forum, wiki

Impacted Systems: www, forum, wiki, affiliate

www.foxycart.com, forum.foxycart.com, and wiki.foxycart.com are all experiencing downtime due to unexpected downtime at the datacenter where these sites are located, Slicehost. Our sites should come back up as soon as Slicehost addresses the issue. We will update here as necessary.

Important to note is that this does not affect the FoxyCart system itself. Stores are still live, and the admin is functional as well.

00:07 PST: Back up.

 

Friday, Sep 11th

2009-09-11 10:00pm12:00am -07:00 (PST) 2009-09-12 05:002009-09-12 07:00 +00:00 (GMT)

Server Migration

Impacted Systems: app, admin

In order to continue to provide reliable e-commerce service to our users, we are upgrading our main application servers. The new servers are faster, helping us to handle future growth while providing increased security, redundancy and performance. We are also switching datacenters in order to improve response times to Europe. In addition, we have expanded our security audits and ensured that your data will remain as strongly protected as ever on our new servers.

We are excited about this move and want to assure you this has been planned in great detail to ensure a successful transition.

Impact on Connectivity

Because this migration involves two separate datacenters, it will involve downtime. We have scheduled 2 full hours for this transition:

  • US, PST: 2009-09-11 10:00pm-12:00am PST
  • US, EST: 2009-09-12 01:00am-03:00am EST
  • International, GMT: 2009-09-12 05:00 and 2009-09-12 07:00 GMT

During this period you should plan for any FoxyCart requests to be non-functional. How requests will be handled is outlined on the new status.foxycart.com.

Planning

Strategically planning for this migration has included several dry runs and testing to ensure the highest level of success possible. Although there is very little room for human error and we do not anticipate any problems at this point, the entire FoxyCart team will be monitoring this procedure to immediately resolve any issues should they arise.

Recovery Plan

Should problems occur, we will revert DNS entries back to our current servers. We have already shortened the TTL (Time-to-Live) values for all of our affected DNS records so that our DNS changes will take effect as immediately as possible. We don’t anticipate needing to switch back, but this is our “just in case” fallback position in order to restore service by the end of the maintenance window.

Links of interest:

UPDATES:

02:14 EDT: Most of the migration is complete, and testing is underway. Two issues:

  1. Some custom subdomains were not transferred over correctly.
  2. Custom subdomains are currently experiencing issues with the firewall on the new environment. They are not currently using the new failover functionality, and may be timing out. We are working on a fix and will update shortly.

02:22 EDT: Determined that the custom subdomain issue isn’t a firewall problem, but a routing issue within the new server cluster.

02:28 EDT: Fixed the custom subdomains. Issue was a script that didn’t get run when the main network interface was turned on.

02:36 EDT: Discovered an issue with email receipts from stores using custom subdomains. Resolved.

03:00 EDT: Turned on public access to the new systems, right on schedule. If you notice any issues please let us know in the forum.