Using FileReadLine() With Seekable Files In ColdFusion
Last week, I started to explore seekable files in ColdFusion. A seekable file allows us to jump to an arbitrary offset within the file contents (which I believe can be done without having to read the entire file into memory). I've recently been dealing with consuming large text-files at work; and, I'm wondering if a seekable file might be something I can use to create a "resumable" consumption process. As such, I wanted to play around with using the fileReadLine()
function in conjunction with seekable files in ColdFusion.
At work, we use Docker and Kubernetes on top of AWS (Amazon Web Services). Which means, at any moment - for a variety of reasons - the current node might be spun-down and removed with little warning. Which means, any long-running process must have a way to track its current progress; and, when necessary, pick up where it left off.
When I'm consuming data in a database, this usually means storing some primary-key cursor. But, when consuming data in a large text file (think a CSV or Newline-Delimited JSON file), I've historically treated this as an all-or-nothing task. I'm thinking now, however, that with a seekable file, I can store the character offset as a way to create a resumable file consumption process.
In order to actually get this to work, I would need to be able to persist two things out-of-band:
- The file being consumed.
- The current character-offset within the file.
For the sake of this exploration, I'm not going to worry about those two points (as that adds quite a bit of complexity). Instead, I'm just going to iterate over a file:
- Open the file.
- Seek to the next offset.
- Read the next line of text.
- Close the file.
I'll keep opening, closing, and reading from the file until I hit the end-of-file (EOF) state. This way, I can look at how I can incrementally read file data without needing to figure out how I might actually persist the state across requests.
In the following ColdFusion code, I'm going to read-in a poem and echo it out line-by-line. Across each iteration of the while()
loop, I'm going to maintain an offset
value. This is the point within the file to which I will seek after I open the file the next time:
<cfscript>
offset = 0;
lineNumber = 0;
CARRIAGE_RETURN = chr( 13 );
NEWLINE = chr( 10 );
// NOTE: It's silly and unnecessary to open a file, read a line, and then close the
// file. I'm only using this approach because I'm exploring the interplay between
// seekable files and the fileReadLine() function.
while ( true ) {
// Open the file in SEEK mode. This allows us to jump to an arbitrary point within
// the file.
poem = fileOpen( "./data.txt", "read", "utf-8", true ); // true = seekable
try {
fileSeek( poem, offset );
// If we have seeked past the end of the file, the file will now report as EOF
// (end-of-file); and, we know that we are done with our data consumption.
if ( fileIsEof( poem ) ) {
break;
}
// When we read the next line of data, it will do so from the SEEK OFFSET that
// we just jumped to.
line = fileReadLine( poem );
lineNumber++;
echo( "(#lineNumber#) #line# <br />" );
// The fileReadLine() function reads UP TO but NOT INCLUDING the next line-
// delimiter. If the file is using a Unix-based delimiter, it's 1-character
// (the newline). But, if the file is using a Windows-based line-delimiter,
// it's going to be 2-characters (carriage return + newline). We can't know
// what it's going to be ahead of time, so we have to test the next character
// to see which it is.
offset += line.len();
// Move to the end of the line we just consumed.
fileSeek( poem, offset );
// If we're not at the end of the file, our next goal is to pick an offset
// that moves PAST the line-delimiter.
if ( ! fileIsEof( poem ) ) {
c = fileRead( poem, 1 );
// If the next character is the carriage return, we're going to assume
// that we have a two-character line-delimiter.
if ( c == CARRIAGE_RETURN ) {
offset += 2;
}
// If the next character is the newline, we're going to assume that we
// have a 1-character line-delimiter.
if ( c == NEWLINE ) {
offset += 1;
}
}
} finally {
fileClose( poem );
}
}
</cfscript>
As you can see, every time I open the file, I call fileSeek()
to move to the previously-calculated offset. And, as long we're not at the end-of-file, we read in the next line using fileLineRead()
. The big complication here is that the fileLineRead()
does not include the line-delimiter. Which means, we have to inspect the data following the line in order to figure out how many characters we need to move ahead on the next seek offset.
That said, when we run this ColdFusion code, we get the following output:
(1) Out of the night that covers me,
(2) Black as the pit from pole to pole,
(3) I thank whatever gods may be
(4) For my unconquerable soul.
(5)
(6) In the fell clutch of circumstance
(7) I have not winced nor cried aloud.
(8) Under the bludgeonings of chance
(9) My head is bloody, but unbowed.
(10)
(11) Beyond this place of wrath and tears
(12) Looms but the Horror of the shade,
(13) And yet the menace of the years
(14) Finds and shall find me unafraid.
(15)
(16) It matters not how strait the gate,
(17) How charged with punishments the scroll,
(18) I am the master of my fate,
(19) I am the captain of my soul.
As you can see, we were able to seek-to and read one line at a time in between opening and closing the file.
Seeing this in action, I'm able to imagine a scenario in which I store a temporary CSV / NDJSON file somewhere, and persist the next offset, and then be able to parse the file one line at a time even if the current process were to be cut-short for some reason. The devil is in the details; but, having a seekable file certainly makes this possible.
Want to use code from this post? Check out the license.
Reader Comments
Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →