Uploading files in rails is a relatively easy task. There are a lot of helpers to manage this even more flexible, such as attachment_fu or paperclip. But what happens if your upload VERY VERY LARGE files (say 5GB) in rails, do the standard solutions apply? The main thing is that we want to avoid load file in memory strategies and avoid multiple temporary file writes.
This document describes our findings of uploading these kind of files in a rails environment. We tried the following alternatives:
- Using Webrick
- Using Mongrel
- Using Merb
- Using Mongrel Handlers
- Using Sinatra
- Using Rack Metal
- Using Mod_Rails aka Passenger
- Non-Rails Alternatives
And i'm afraid, the new is not that good. For now....
A simple basic Upload Handle (to get started)
Ok , let's make a little upload application . (loosely based upon http://www.tutorialspoint.com/ruby-on-rails/rails-file-uploading.htm
Install rails (just to show you the version I used)
$ gem install rails Successfully installed rake-0.8.4 Successfully installed activesupport-2.3.2 Successfully installed activerecord-2.3.2 Successfully installed actionpack-2.3.2 Successfully installed actionmailer-2.3.2 Successfully installed activeresource-2.3.2 Successfully installed rails-2.3.2The first step is to create controller that has two actions, on 'index' it will show a form "uploadfile.html.erb' and the action 'upload' will handle the upload
$ gem install sqlite3-ruby $ rails upload-test $ cd upload-test $ script/generate controller Upload exists app/controllers/ exists app/helpers/ create app/views/upload exists test/functional/ create test/unit/helpers/ create app/controllers/upload_controller.rb create test/functional/upload_controller_test.rb create app/helpers/upload_helper.rb create test/unit/helpers/upload_helper_test.rb
#app/controller/upload_controller.rb
class UploadController < ApplicationController
def index
render :file => 'app/views/upload/uploadfile.html.erb'
end
def upload
post = Datafile.save(params[:uploadform])
render :text => "File has been uploaded successfully"
end
end
The second create the view to have file upload form in the browser. Note the multipart parameter to do a POST
#app/views/upload/uploadfile.html.erb <% form_for :uploadform, :url => { :action => 'upload'}, :html => {:multipart => true} do |f| %> <%= f.file_field :datafile %><br /> <%= f.submit 'Create' %> <% end %>
Last is to create the model , to save the uploaded file to public/data. Note the orignal_filename we use to
#app/models/datafile.rb class Datafile < ActiveRecord::Base def self.save(upload) name = upload['datafile'].original_filename directory = "public/data" # create the file path path = File.join(directory, name) # write the file File.open(path, "wb") { |f| f.write(upload['datafile'].read) } end end
Before we startup we create the public/data dir
$ mkdir public/data $ ./script server webrick => Booting WEBrick => Rails 2.3.2 application starting on http://0.0.0.0:3000 => Call with -d to detach => Ctrl-C to shutdown server [2009-04-10 13:18:27] INFO WEBrick 1.3.1 [2009-04-10 13:18:27] INFO ruby 1.8.6 (2008-03-03) [universal-darwin9.0] [2009-04-10 13:18:27] INFO WEBrick::HTTPServer#start: pid=5057 port=3000Point your browser to http://localhost:3000/upload and you can upload a file. If all goes well, there should be a file public/data with the same name as your file that your uploaded.
Scripting a large Upload
Browser have their limitations for file uploads. Depending on if your working on 64Bit OS, 64 Bit Browser , you can upload larger files. But 2GB seems to be the limit.
For scripting the upload we will use curl to do the same thing. To upload a file called large.zip to our form, you can use:
curl -Fuploadform['datafile']=@large.zip http://localhost:3000/upload/uploadIf you would use this, rails would throw the following error: "ActionController::InvalidAuthenticityToken (ActionController::InvalidAuthenticityToken):"
As described in http://ryandaigle.com/articles/2007/9/24/what-s-new-in-edge-rails-better-cross-site-request-forging-prevention is is used to protect rails against cross site request forging. We need to have rails skip this filter.
#app/controller/upload_controller.rb class UploadController < ApplicationController skip_before_filter :verify_authenticity_tokenWebrick and Large File Uploads
Webrick is the default webserver that ships with rails. Now let's upload a large file and see what happens.
Ok, it's natural that this takes longer to handle. But if you zoom on the memory usage of your ruby process, f.i. with top
7895 ruby 16.0% 0:26.61 2 33 144 559M 188K 561M 594M====> Memory GROWS: We see that the ruby process is growing and growing. I guess it is because webrick loads the body in a string first.
#gems/rails-2.3.2/lib/webrick_server.rb def handle_dispatch(req, res, origin = nil) #:nodoc: data = StringIO.new Dispatcher.dispatch( CGI.new("query", create_env_table(req, origin), StringIO.new(req.body || "")), ActionController::CgiRequest::DEFAULT_SESSION_OPTIONS, data )=====> Files get written to disk Multiple times for the Multipart parsing: When the file is upload, you see message appearing in the webrick log. It has a file in /var/folder/EI/....
Processing UploadController#upload (for ::1 at 2009-04-09 13:51:23) [POST] Parameters: {"commit"=>"Create", "authenticity_token"=>"rf4V5bmHpxG74q6ueI3hUjJzwhTLUJCp9VO1uMV1Rd4=", "uploadform"=>{"datafile"=>#<File:/var/folders/EI/EIPLmNwOEea96YJDLHTrhU+++TI/-Tmp-/RackMultipart.7895.1>}} [2009-04-09 14:09:03] INFO WEBrick::HTTPServer#start: pid=7974 port=3000It turns out, that the part that handles the multipart, writes the files to disk in the $TMPDIR. It creates files like
$ ls $TMPDIR/ RackMultipart.7974.0 RackMultipart.7974.1Strange, two times? We only uploaded one file? I figure this is handled by the rack/utils.rb bundled in action_controller. Possible related is this bug described at https://rails.lighthouseapp.com/projects/8994/tickets/1904-rack-middleware-parse-request-parameters-twice
#gems/actionpack-2.3.2/lib/action_controller/vendor/rack-1.0/rack/utils.rb # Stolen from Mongrel, with some small modifications: def self.parse_multipart(env) write multiOptimizing the last write to disk
Instead of
# write the file File.open(path, "wb") { |f| f.write(upload['datafile'].read) }We can use the following to avoid writing to disks our selves
FileUtils.mv upload['datafile'].path, pathThis makes use from the fact that the file is allready on disk, and a file move is much faster then rewriting the file.
Still this might not be usable in all cases: If your TMPDIR is on another filesystem then your final destination, this trick won't help you.
Mongrel and Large File Uploads The behaviour of Webrick allready was discussed on the mongrel mailinglist http://osdir.com/ml/lang.ruby.mongrel.general/2007-10/msg00096.html And is supposed to be fixed. So let's install mongrell
$ gem install mongrel Successfully installed gem_plugin-0.2.3 Successfully installed daemons-1.0.10 Successfully installed fastthread-1.0.7 Successfully installed cgi_multipart_eof_fix-2.5.0 Successfully installed mongrel-1.1.5 $ mongrel_rails startOk, let's start the upload again using our curl:
======> Memory does not grow: that's good news.
======> 4 file writes! for 1 upload : because Mongrel does not keep the upload in memory, it writes it to a tempfile in the $TMPDIR. Depending on the size of the file, > MAX_BODY it will create a tempfile or just a string in memory
lib/mongrel/const.rb # This is the maximum header that is allowed before a client is booted. The parser detects # this, but we'd also like to do this as well. MAX_HEADER=1024 * (80 + 32)In our tests, we saw that aside from the RackMultipart.<pid>.x files, there is additional file written in $TMPDIR: mongrel.<pi>.0
# Maximum request body size before it is moved out of memory and into a tempfile for reading. MAX_BODY=MAX_HEADER
lib/mongrel/http_request.rb # must read more data to complete body if remain > Const::MAX_BODY # huge body, put it in a tempfile @body = Tempfile.new(Const::MONGREL_TMP_BASE) @body.binmode else # small body, just use that @body = StringIO.new end
That means that for 5 GB, we now have 4x 5GB : 1 mongrel + 2 RackMultipart + 1 final file (depending on the move or not)= 20 GB
======> Not reliable , predictable results?
Also, we saw the upload sometimes: mongrel did not create the RackMultiparts but CGI.<pid>.0 . Unsure what the reasons is. Merb and Large File Uploads
One of the solutions you see for handling file uploads is using Merb, the main reason that there is less blocking of your handlers.
- http://www.idle-hacking.com/2007/09/scalable-file-uploads-with-merb/
- http://devblog.rorcraft.com/2008/8/25/uploading-large-files-to-rails-with-merb
- http://blog.vixiom.com/2007/06/29/merb-on-air-drag-and-drop-multiple-file-upload/
$ gem install merb Successfully installed dm-aggregates-0.9.11 Successfully installed dm-validations-0.9.11 Successfully installed randexp-0.1.4 Successfully installed dm-sweatshop-0.9.11 Successfully installed dm-serializer-0.9.11 Successfully installed merb-1.0.11Let's create the merb application:
$ merb-gen app uploader-app $ cd uploader-appWe need to create the controller, but this a bit different from our original controller:
- the file is called upload.rb instead of upload_controller.rb
- removed the skip_before
- in Merb it is Application and not ApplicationController
#app/controllers/upload.rb class Upload < Application def index render :file => 'app/views/upload/uploadfile.rhtml'
end def upload post = Datafile.save(params[:uploadform]) render :text => "File has been uploaded successfully" end endThe model looks like this:
- Remove the ActiveRecord
- include DataMapper::Resource
- original_filename does not exist: merb passes it in the variable filename
- tempfile is also changed on how merb passes the temporary file
#app/models/datafile.rb class Datafile include DataMapper::Resource def self.save(upload) name = upload['datafile']['filename'] directory = "public/data" # create the file path path = File.join(directory, name) # write the file File.open(path, "wb") { |f| f.write(upload['datafile']['tempfile'].read) } endWe create the public/data
$ mkdir public/dataAnd start merb .
$ merb ~ Connecting to database... ~ Loaded slice 'MerbAuthSlicePassword' ... ~ Parent pid: 57318
~ Compiling routes... ~ Activating slice 'MerbAuthSlicePassword' ...
merb : worker (port 4000) ~ Starting Mongrel at port 4000When you start the upload, a merb worker becomes active.
=====> No memory increases : good!
merb : worker (port 4000) ~ Successfully bound to port 4000=====> 3 Filewrites: 1 mongrel + 1 merb + 1 final write
Mongrel first start writing its mongrel.<pid>.0 in our $TMPDIR/
merb : worker (port 4000) ~ Params: {"format"=>nil, "action"=>"upload", "id"=>nil, "controller"=>"upload", "uploadform"=>{"datafile"=>{"content_type"=>"application/octet-stream", "size"=>306609434, "tempfile"=>#<File:/var/folders/EI/EIPLmNwOEea96YJDLHTrhU+++TI/-Tmp-/Merb.13243.0>, "filename"=>"large.zip"}}} merb : worker (port 4000) ~After that Merb handles the multipart stream and writes once in $TMPDIR/Merb.<pid>.0
Sinatra and Large Files:
Sinatra is a simple framework for describing the controllers yourself. Because it seemed to have direct access to the stream, I hoped that i would be able to stream it directly without the MultiPart of Rack.
- http://technotales.wordpress.com/2008/03/05/sinatra-the-simplest-thing-that-could-possibly-work/
- http://m.onkey.org/2008/11/10/rails-meets-sinatra
- http://www.slideshare.net/jiang.wu/ruby-off-rails
- http://sinatra-book.gittr.com/
$ gem install sinatra Successfully installed sinatra-0.9.1.1 1 gem installed Installing ri documentation for sinatra-0.9.1.1... Installing RDoc documentation for sinatra-0.9.1.1...Create a sample upload handler:
#sinatra-test-upload.rb require 'rubygems' require 'sinatra' post '/upload' do File.open("/tmp/theuploadedfile","wb") { |f| f.write(params[:datafile]['file'].read) } end
$ ruby upload-sinatra.rb == Sinatra/0.9.1.1 has taken the stage on 4567 for development with backup from Mongrel====> No memory increase: good!
So instead of 3000 it listens on 4567
====> 4 file writes: Again we see 4= 1 Mongrel.<pid>.* + 2 x Multipart.<pid>.* + 1 file write
Using Mongrel handlers to bypass other handlers
Up until now, we have the webserver, the multipart parser and the final write. So how can we skip the webserver or the multipart writing to disk and not consuming all the memory.
I found another approach by using a standalone mongrel handler:
- http://rubyenrails.nl/articles/2007/12/24/rails-mvc-aan-je-laars-lappen-met-mongrel-handlers
- http://www.ruby-forum.com/topic/128070
Let's create an example Mongrel Handler. It's just the part that shows you that you can access the request directly:
require 'rubygems' require 'mongrel'
class HelloWorldHandler < Mongrel::HttpHandler def process(request, response)
puts request.body.path response.start(200) do |head,out| head['Content-Type'] = "text/plain" out << "Hello world!" end end def request_progress (params, clen, total) end end
Mongrel::Configurator.new do listener :port => 3000 do uri "/", :handler => HelloWorldHandler.new end
run; join end=====>No memory increase: good!
=====>1 FILE and direct access, but still needs multipart parsing:
It turns out that request.body.path is the mongrel.<pid>.0 file , giving us directly access to the first uploaded file.
request.body.path = /var/folders/EI/EIPLmNwOEea96YJDLHTrhU+++TI/-Tmp-/mongrel.93690.0
Using Rails Metal Metal is an addition to Rails 2.3 that allows you to bypass the rack.
- http://soylentfoo.jnewland.com/articles/2008/12/16/rails-metal-a-micro-framework-with-the-power-of-rails-m
- http://railscasts.com/episodes/150-rails-metal
- http://www.pathf.com/blogs/2009/03/uploading-files-to-rails-metal/
- http://www.ruby-forum.com/topic/171070
# Allow the metal piece to run in isolation require(File.dirname(__FILE__) + "/../../config/environment") unless defined?(Rails) class Uploader def self.call(env) if env["PATH_INFO"] =~ /^\/uploader/ puts env["rack.input"].path
[200, {"Content-Type" => "text/html"}, ["It worked"]] else [400, {"Content-Type" => "text/html"}, ["Error"]] end end end
Similar to the Mongrel HTTP Handler, we can have access to the mongrel file upload by
env["rack.input"].path = actually the /var/folders/EI/EIPLmNwOEea96YJDLHTrhU+++TI/-Tmp-/mongrel.81685.0If we want to parse this, we can pass the env to the Request.new but this kicks in the RackMultipart again.
request = Rack::Request.new(env) puts request.POST #uploaded_file = request.POST["file"][:tempfile].read=====>No memory increase: good!
=====>1 FILE and direct access, but still needs multipart parsing
=====>Can still run traditional rails and metal rails in the same webserver
Using Mod_rails aka Passenger
Mod_rails seems to become the new standard for running rails applications without the blocking hassle using plain apache as a good stable proven technology.
One of the main benefits that it doesn't block the handler to send response back until the complete request is handled. Sounds like good technology here!
curl -v -F datafile['file']=@large.zip http://localhost:80/ * About to connect() to localhost port 80 * Trying 127.0.0.1... connected * Connected to localhost (127.0.0.1) port 80 > POST /datafiles HTTP/1.1 > User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5 > Host: localhost > Accept: */* > Content-Length: 421331151 > Expect: 100-continue > Content-Type: multipart/form-data; boundary=----------------------------1bf75aea2f35 > < HTTP/1.1 100 ContinueSetting up mod_rails is beyond the scope of this document. So we assume you have it working for your rails app.
in my /etc/httpd/conf/httpd.conf
LoadModule passenger_module /opt/ruby-enterprise-1.8.6-20090201/lib/ruby/gems/1.8/gems/passenger-2.1.3/ext/apache2/mod_passenger.so PassengerRoot /opt/ruby-enterprise-1.8.6-20090201/lib/ruby/gems/1.8/gems/passenger-2.1.3 PassengerRuby /opt/ruby-enterprise-1.8.6-20090201/bin/rubyMod_rails has a nice setting that you can specify your Tmpdir per virtual host:
See http://www.modrails.com/documentation/Users%20guide.html#_passengertempdir_lt_directory_gt for more details
Ok let's start the upload and see what happens:
5.10. PassengerTempDir <directory>
Specifies the directory that Phusion Passenger should use for storing temporary files. This includes things such as Unix socket files, buffered file uploads, etc.
This option may be specified once, in the global server configuration. The default temp directory that Phusion Passenger uses is /tmp.
This option is especially useful if Apache is not allowed to write to /tmp (which is the case on some systems with strict SELinux policies) or if the partition that /tmp lives on doesn’t have enough disk space.
=====> Memory goes up!
# ./passenger-memory-stats -------------- Apache processes --------------- PID PPID Threads VMSize Private Name ----------------------------------------------- 30840 1 1 184.3 MB 0.0 MB /usr/sbin/httpd 30852 30840 1 186.2 MB ? /usr/sbin/httpd 30853 30840 1 184.3 MB ? /usr/sbin/httpd 30854 30840 1 184.3 MB ? /usr/sbin/httpd 30855 30840 1 184.3 MB ? /usr/sbin/httpd 30856 30840 1 184.3 MB ? /usr/sbin/httpd 30857 30840 1 184.3 MB ? /usr/sbin/httpd 30858 30840 1 184.3 MB ? /usr/sbin/httpd 30859 30840 1 184.3 MB ? /usr/sbin/httpd ### Processes: 9 ### Total private dirty RSS: 0.03 MB (?)
---------- Passenger processes ----------- PID Threads VMSize Private Name ------------------------------------------ 30847 4 14.1 MB 0.1 MB /opt/ruby-enterprise-1.8.6-20090201/lib/ruby/gems/1.8/gems/passenger-2.1.3/ext/apache2/ApplicationPoolServerExecutable 0 /opt/ruby-enterprise-1.8.6-20090201/lib/ruby/gems/1.8/gems/passenger-2.1.3/bin/passenger-spawn-server /opt/ruby-enterprise-1.8.6-20090201/bin/ruby /tmp/passenger.30840/info/status.fifo 30848 1 87.7 MB ? Passenger spawn server 30888 1 123.6 MB 0.0 MB Passenger ApplicationSpawner: /home/myrailsapp 30892 1 1777.4 MB 847.5 MB Rails: /home/myrailsapp ### Processes: 4 ### Total private dirty RSS: 847.62 MB (?)Very strange: in the /opt/ruby-enterprise-1.8.6-20090201/lib/ruby/gems/1.8/gems/passenger-2.1.3/ext/apache2/Hooks.cpp of the passenger source
expectingUploadData = ap_should_client_block(r); if (expectingUploadData && atol(lookupHeader(r, "Content-Length")) > UPLOAD_ACCELERATION_THRESHOLD) { uploadData = receiveRequestBody(r); }the part expectionUploadData is the one that sends the
> Expect: 100-continueBut is seems curl, isn't handling this request, it keeps on streaming the file, ignoring the response.
To avoid having mod_Rails sending this, we can fall back to http/1.0 using -0 on the curl options.
$ curl -v -0 -F datafile['file']=@large.zip http://localhost:80 * About to connect() to localhost port 80 * Trying 127.0.0.1... connected * Connected to localhost (127.0.0.1) port 80 > POST /uploader/ HTTP/1.0 > User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5 > Host: localhost > Accept: */* > Content-Length: 421331151 > Content-Type: multipart/form-data; boundary=----------------------------1b04b7cb6566Now the correct mechanism happens.
/tmp/passenger.1291/backends/backend.g0mi40ARBFbEdb08pxB3uzyh3JJyfR1eaI9xPuQwyLEd3NjQ24rbpSBb9FrZfNX5WI5VYQ====> Memory doesn't go up: good! (again)
====> Same number of files = 1 /tmp/passenger + similar to previous examples
The alternatives: (non-rails)
The problem so far, is mainly a problem of implementation, there is no reason why streaming a file upload would not be possible in rails.
The correct hooks for streaming the file directly to a handler without temporary files or memory, are currently just not there.
I hope eventually we will see an Upload streaming API (similar to the download Stream API) and a streamable Multipart handler.
Alternative 1: have the webserver handle our stream directly
- http://apache.webthing.com/mod_upload/: a apache module for doing uploads directly in the webserver
- http://www.motionstandingstill.com/nginx-upload-awesomeness/2008-08-13/: a nginx module for doing uploads
Using a Raw HTTP Server, Plain sockets to implement webserver, http://lxscmn.com/tblog/?p=25
Alternative 3: Use apache commons fileupload component in java
This component is exactly what we need in rails/ruby. http://commons.apache.org/fileupload/
Up until now, this is what we will use. It has streamable API for both the incoming AND the multiparts!
Read more at https://www.jedi.be/blog/2009/04/10/java-servlets-and-large-large-file-uploads-enter-apache-fileupload/