Reading In File Data One Line At A Time Using ColdFusion's CFLoop Tag Or Java's LineNumberReader

By Ben Nadel

Published 2010-09-15 in ColdFusion — Comments (21)

Last week on Twitter, someone asked about reading in files that were too big to fit in the allocated RAM on the JVM. To this problem, I suggested that the developer try using the file line reader functionality built into ColdFusion 8's CFLoop tag. After this discussion ended, someone else asked me to blog about this new CFLoop functionality as they had never heard of it before. As such, I figured I'd put together this quick ColdFusion demo.

As of ColdFusion 8, there are two new CFLoop attributes related to file parsing:

File - The expanded path of the file to read.
Characters - The number of characters to read from the file with each iteration.

While the File attribute is required for file reading, the Characters attribute is not. If the Characters attribute is omitted, ColdFusion defaults to reading in the file one line at a time (as defined by standard line delimiters - \r, \n, and \r\n). In this case (characters omitted), the Index variable of the loop will contain the line data, minus the line delimiters. If the Characters attribute is provided, the Index variable of the loop will contain the number of characters as defined by the Characters attribute (including the line delimiters).

To see both of these scenarios in action (by-line and by-characters), let's take a look at the following ColdFusion demo:

<!---
	We are going to be reading in a file, line by line, so first,
	let's create a file to read. Define the path to the file we
	are going to populate.
--->
<cfset filePath = expandPath( "./data.txt" ) />

<!---
	Delete the file if it exists so that we don't keep populating
	the same document.
--->
<cfif fileExists( filePath )>

	<cfset fileDelete( filePath ) />

</cfif>

<!--- Write some data to the file. --->
<cfloop
	index="i"
	from="1"
	to="10"
	step="1">

	<cffile
		action="append"
		file="#filePath#"
		output="This is line #i# in this file."
		addnewline="true"
		/>

</cfloop>


<!--- ----------------------------------------------------- --->
<!--- ----------------------------------------------------- --->


<cfoutput>


	<!---
		Now, we are going to read the file in line-by-line using
		ColdFusion 8's new CFLoop behavior. The File attribute
		tells ColdFusion what file to read in, the Index attribute
		defines the variable into which ColdFusion will put the
		parsed text line.
	--->
	<cfloop
		index="line"
		file="#filePath#">

		Line: #line#<br />

	</cfloop>


	<br />


	<!---
		CFLoop also allows for a Characters attribute. If we omit
		this attibute (as above), ColdFusion reads the file line-by-
		line. If we use the Characters attribute, however, ColdFusion
		will read the file a chunk at a time based on the number of
		characters defined.

		Here, we are going to read the file in 50 characters at
		a time.
	--->
	<cfloop
		index="chunk"
		file="#filePath#"
		characters="50">

		50 Char Chunk: #chunk#<br />

	</cfloop>


</cfoutput>

The first part of this demo simply creates and populates the file that we are going to read-in. Then, I use two CFLoop tags - one with just the File attribute and one with both the File and Characters attribute. When we run the above code, we get the following page output:

Line: This is line 1 in this file.
Line: This is line 2 in this file.
Line: This is line 3 in this file.
Line: This is line 4 in this file.
Line: This is line 5 in this file.
Line: This is line 6 in this file.
Line: This is line 7 in this file.
Line: This is line 8 in this file.
Line: This is line 9 in this file.
Line: This is line 10 in this file.

50 Char Chunk: This is line 1 in this file. This is line 2 in thi
50 Char Chunk: s file. This is line 3 in this file. This is line
50 Char Chunk: 4 in this file. This is line 5 in this file. This
50 Char Chunk: is line 6 in this file. This is line 7 in this fil
50 Char Chunk: e. This is line 8 in this file. This is line 9 in
50 Char Chunk: this file. This is line 10 in this file.

As you can see, when we provide the File attribute but omit the Characters attribute, ColdFusion will read the file in one line at a time. When we include the Characters attribute, ColdFusion will read the file in one-character-chunk at a time.

NOTE: While it is not represented in the rendered output, by-line reading does not include line delimiters; by-characters reading, on the other hand, does include line delimiters.

It's awesome how easy ColdFusion makes some of this functionality. And, while I can't be sure, I would guess that ColdFusion is using Java's LineNumberReader under the covers. The LineNumberReader class provides both by-line and by-characters parsing which makes it ideal for this new combination of CFLoop attributes.

If you are not using ColdFusion 8+ yet, you can still get this kind of functionality by dipping down into the Java layer and invoking the LineNumberReader class directly. ColdFusion provides a clean, simple abstraction for this functionality, so you'll see that using the LineNumberReader directly is quite a bit more complicated.

In the following demo, I am going to replicate the previous CFLoop output using the LineNumberReader class:

<!---
	We are going to be reading in a file, line by line, so first,
	let's create a file to read. Define the path to the file we
	are going to populate.
--->
<cfset filePath = expandPath( "./data.txt" ) />

<!---
	Delete the file if it exists so that we don't keep populating
	the same document.
--->
<cfif fileExists( filePath )>

	<cfset fileDelete( filePath ) />

</cfif>

<!--- Write some data to the file. --->
<cfloop
	index="i"
	from="1"
	to="10"
	step="1">

	<cffile
		action="append"
		file="#filePath#"
		output="This is line #i# in this file."
		addnewline="true"
		/>

</cfloop>


<!--- ----------------------------------------------------- --->
<!--- ----------------------------------------------------- --->


<!---
	If you are not on ColdFusion 8 yet, you can still read files
	in a line at a time by dipping down into the Java layer.
	Behind the scenes, ColdFusion is probably using some sort of
	buffered file reader, so we can do the same explicitly.

	To create the line number reader, we have to pass it a Reader
	object, which will a buffered reader for performance reasons.
--->
<cfset lineReader = createObject( "java", "java.io.LineNumberReader" ).init(
	createObject( "java", "java.io.BufferedReader" ).init(
		createObject( "java", "java.io.FileReader" ).init(
			javaCast( "string", filePath )
			)
		)
	) />

<!---
	Mark the beginning of the stream so we can reset the position
	of the reader if we need to.

	NOTE: You typically won't need this - I just need to do this so
	I can demonstrate two file reads without creating a new line
	number reader object.
--->
<cfset lineReader.mark(
	javaCast( "int", 999999 )
	) />


<cfoutput>


	<!---
		Now, let's read the file in a line at a time. As we use the
		readLine(), it will return a NULL when it gets to the end of
		the file. When that happens, the variable we are using to
		read the line will be deleted.
	--->
	<cfset line = lineReader.readLine() />

	<!---
		Check to make sure we didn't hit the end of the file (which
		will return NULL, which will delete our variable).
	--->
	<cfloop condition="structKeyExists( variables, 'line' )">

		Line: #line#<br />

		<!--- Read the next line. --->
		<cfset line = lineReader.readLine() />

	</cfloop>


	<br />


	<!---
		Reset the line number reader to the beginning of input
		stream for next demo.
	--->
	<cfset lineReader.reset() />

	<!---
		We can also use the buffered line reader to read in chunks
		of the file as we did with the CFLoop tag. This is a bit
		more compliated as we need to read the character data into
		a character array.
	--->

	<!---
		Create a character array of length 50 for out read buffer
		(we will be reading in a max of 50 characters at any time).

		NOTE: It doesn't matter what the inital values are at this
		point since our line number reader will overwrite the data.
	--->
	<cfset buffer = listToArray( repeatString( " ,", 50 ) ) />

	<!---
		Cast the ColdFusion array (collection) to a typed Java array
		so that we can use it with the line number reader.
	--->
	<cfset buffer = javaCast( "char[]", buffer ) />


	<!---
		Read the file data into the buffer and record the number
		of characters that were read.
	--->
	<cfset charCount = lineReader.read(
		buffer,
		javaCast( "int", 0 ),
		javaCast( "int", arrayLen( buffer ) )
		) />

	<!---
		Keep looping while characters were read-in. When the line
		reader hits the end of the file, it will return -1 for the
		character count.
	--->
	<cfloop condition="(charCount neq -1)">

		<!---
			Output the chunk. When we do this, we want to convert
			the buffer to a string and then just take out what's
			needed.
		--->
		<cfset chunk = mid(
			arrayToList( buffer, "" ),
			1,
			charCount
			) />

		50 Char Chunk: #chunk#<br />

		<!---
			Read the next chunk of character data from the file
			into the buffer and record the number of characters
			that were read.
		--->
		<cfset charCount = lineReader.read(
			buffer,
			javaCast( "int", 0 ),
			javaCast( "int", arrayLen( buffer ) )
			) />

	</cfloop>


</cfoutput>

Again, the first part of the demo simply creates and populates the file that we are going to be reading. Once that is done, I then use the readLine() method for by-line parsing and the read() method for by-characters parsing. While the readLine() method is fairly straightforward, the read() method requires us to use a strongly-typed character array buffer which, as you can see, greatly increases the complexity of the code.

When we run the above ColdFusion and Java code, we get the following page output:

Line: This is line 1 in this file.
Line: This is line 2 in this file.
Line: This is line 3 in this file.
Line: This is line 4 in this file.
Line: This is line 5 in this file.
Line: This is line 6 in this file.
Line: This is line 7 in this file.
Line: This is line 8 in this file.
Line: This is line 9 in this file.
Line: This is line 10 in this file.

50 Char Chunk: This is line 1 in this file. This is line 2 in thi
50 Char Chunk: s file. This is line 3 in this file. This is line
50 Char Chunk: 4 in this file. This is line 5 in this file. This
50 Char Chunk: is line 6 in this file. This is line 7 in this fil
50 Char Chunk: e. This is line 8 in this file. This is line 9 in
50 Char Chunk: this file. This is line 10 in this file.

As you can see, this output is exactly the same as the output generated by the CFLoop-only demonstration.

In the above code, you'll notice that the LineNumberReader composes a BufferedReader instance. It's the BufferedReader that really makes this approach (and most likely the CFLoop approach) so efficient. I don't want to talk too much about how buffered readers work, as I'm not really a Java developer; but, they optimize the way the character data is read into memory so as to both minimize disk I/O as well as overall memory consumption.

The CFLoop tag is really one of the most amazing tags in ColdFusion. Between for-loops, query-loops, array-loops, list-loops, file-loops, and conditional-loops there's very little that the CFLoop tag can't do. It makes looping so easy, in fact, that you probably never even think about how much work this ColdFusion tag is actually abstracting. It's like they say - Great design should be invisible. Anyway, I hope this helps clarify how this part of the CFLoop tag works.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/2011

Reader Comments

Khoa Sep 15, 2010 at 12:04 PM

14 Comments

Hey Ben, I just got a quick thought while reading your post, what about reading the whole file into memory (using cffile action=read) and then we use list or maybe better regular expression to split the content string into chunks that we need. For the first case, they can be split into list items using the newline as delimiter. For the second case, we can use string manipulation to get 50 chars at a time (or regexp somehow, gotta think more)

Do you think that will works? Or it would be a lot slower? Like i said, i just thought of it, haven't tried it yet. well, lying on my bed with my ipad right now, so cant test it out :-)

Peter Boughton Sep 15, 2010 at 12:25 PM

57 Comments

The problem with "reading the whole file into memory" is that you're reading the whole file into memory.

Depending on the file you're working with, this might be a difference between occupying 50 bytes of memory or occupying 5,000,000 bytes of memory.

Peter Boughton Sep 15, 2010 at 12:26 PM

57 Comments

(As the opening line of this blog post says: "someone asked about reading in files that were too big to fit in the allocated RAM on the JVM")

Ben Nadel Sep 15, 2010 at 1:00 PM

15,996 Comments

@Vinh,

If you can read the file into memory at one time, it will definitely be faster - the buffering is going to add overhead to the performance. I have definitely seen people deal with CSV files in this way - reading in the file, then converting it to an array using listToArray() in which the line delimiters are used as list-delimiters.

If you cannot read a file into memory, however, the buffered reader approach is going to be slower, but critical for overall performance (ie. not eating up all your RAM and causing overflow problems).

@Peter,

This just made me laugh out loud:

>> "The problem with "reading the whole file into memory" is that you're reading the whole file into memory."

Ha ha :)

Matthew Abbott Sep 15, 2010 at 1:55 PM

46 Comments

Ben,

How would you then extract just a portion of the content in a file, and then do a line by line read on it?

Im having to kind of do what Vinh is talking about. Extracting a portion, without reading in the whole file.

Henry Ho Sep 15, 2010 at 2:22 PM

55 Comments

I thought you'd include a benchmark between cfloop and Java's lineReader. :)

Khoa Sep 15, 2010 at 6:16 PM

14 Comments

@Peter, @Ben hahaha, sorry, stupid me, look like I missed the point of the article :-) I always read the whole file in. I think my applications are just not big enough that I encounter that overflow scenario :-p

Thanks for the article Ben. Next time my app hangs when it reads a file, I know why and will think of you :-p

Robert Haddan Sep 15, 2010 at 6:20 PM

1 Comments

I've been working with this in a Java application this week. For working with large XML files, check out the XML Pull Parser project: http://www.extreme.indiana.edu/xgws/xsoap/xpp/

I haven't worked with this in Coldfusion, but it encapsulates streaming large XML files, provides methods for XPath... the JVM usage went from serious heap overflows to just a blip on the memory.

Ben Nadel Sep 15, 2010 at 6:35 PM

15,996 Comments

@Matthew,

I suppose you could just call readLine() a given number of times until you get to portion you want. How do you identify the portion you are targeting?

@Henry,

Yeah, good point. My server was hemming and hawing this morning so I didn't have a lot of time to do the most in-depth exploration.

@Vinh,

No problem at all :)

@Robert,

Looks very interesting. I've played around a bit with large XML files and I know there are a number of event-driven parses; I've not had a great success with using them inside of ColdFusion. I'll take a look at this one as it seems to be doing something a bit different.

Matthew Abbott Sep 15, 2010 at 8:56 PM

46 Comments

@Ben,

currently im reading in the file and just doing an indexOf or reFind to get start and end positions but i then use mid() to get the portion.

I could do the same going through line by line and when it finds x to start getting the data, and when it finds y it stops. ill try it out and see if its any faster.

Khoa Sep 15, 2010 at 9:27 PM

14 Comments

@Matthew,

If I understand your method correctly, you use indexOf() to find the position of the newline character (or whatever char), mark it as the start, then find the next position of the same char, mark it as end and then use mid() to get the portion. Is that correct? If so, why not use list with the newline (or whatever char) as delimiter?

Ben Nadel Sep 15, 2010 at 11:01 PM

15,996 Comments

@Matthew,

It's probably gonna be faster just using the Mid() approach since you're only doing the search once.

@Vinh,

You make a good point. Converting the list to an array using line delimiters and then using the array index is probably gonna be quite fast.

Gareth Arch Sep 16, 2010 at 10:58 AM

111 Comments

I had this problem last week (reading in a large file) and found the cfloop solution to work nicely. I ran into a timeout issue then though with the cfloop. I managed to solve this by adding my own timer to the page that checks how long the import has been running (read in a line, quick time check, read in a line, time check, etc). Once it hit a preset time, it breaks out of the loop, pushes the user back to the "import" page, I run a javascript redirect (thus CF is now taking a break :) ) which pushes the user back to the CF page to continue the import at whatever line the import left off at. I set the CF Admin timeout to 10 seconds, and my time check to 8 seconds and it was still running (correctly) after 15 minutes. I just make sure to show the user a message to let them know what is happening, but it was a quick way to get around the timeout issues.

Ben Nadel Sep 16, 2010 at 11:23 AM

15,996 Comments

@Gareth,

Wow - checking the execution time throughout the page - there's something kind of brilliant about that. I never thought of doing that.

Gareth Arch Sep 16, 2010 at 4:43 PM

111 Comments

@Ben,
Thanks! It came down to necessity at the time. It's crazy what you come up with when it's late, you're running out of time and need to have a solution :) I had tried at first to use the method you described in a blog post a while back about extending the timeout so you can perform different tasks, but depending on the file size, this didn't seem feasible (and I'm sure would've caused fits on the server). The only way I could see getting around the timeout was to leave the CF execution and pass the reins back to the browser, then start the CF process up again. As long as you know the timeout in your administrator and set the timer to less than that value, it should run nicely.

Ben Nadel Sep 17, 2010 at 10:54 PM

15,996 Comments

@Gareth,

Yeah, I've had trouble with my previous suggestion as well. In fact, when I recently changed it, I started getting a *ton* of error emails. At first, I thought maybe I was just getting new errors; what I think was happening though, was that my previous concept was just failing and the emails were never getting sent.

I like you're style, good sir.

Sami Hoda Sep 18, 2010 at 12:26 PM

11 Comments

If you are looking for specific line numbers, you could use this CFC I wrote a while back:
http://filebyline.riaforge.org/

Its pretty flexible - some methods: getLine, setLine, insertLine, deleteLine...

Raheman Nov 30, 2012 at 8:42 AM

3 Comments

In a file contains the data as |(pipe) delimiter format and the file contains 50 rows. In that file each line 17 th | delimeter data I want, and How to change that 17th column data in same line and how to store that data in to same line in the same file.

can you pls post the sample codes for the above

Raheman Dec 3, 2012 at 3:06 AM

3 Comments

how to reads one line from an input file, does some processing on it, and writes the resultant data to the same line in the same file.
suppose the line contains the data in delimiter format,I want to change the data in 17 th delimiter then how to add the modified data in the same location(17 th delimiter) in the same line of the same file in cold fusion.

Lars Lodewyks Jun 26, 2013 at 4:37 AM

1 Comments

Thanks, the blog post was very useful! I had to analyze very large files. I couldn't read them as a whole because of their size.
But you probably should consider closing the readers in the end. ;) (CF < 8 solution)

Leo Mar 11, 2014 at 6:20 PM

1 Comments

Thanks Ben for you very interesting articles on handling large files. As I am not a programmer, I am quite insecure whether my issue is directly related to your description or not.
My problem is, that I've got a PDF file generated by Adobe Coldfusion which contains several languages. In fact we always only display one of these languages at the same time, but with each language we add to the PDF it takes longer to open (> 30 sec).
According to my interpretation of the underlying XML structure the file should be parsed quite quickly and only the lines of the active language have to be considered, but somehow we don't find a solution for that.
Can you give us a hint how to solve this issue?

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.