• Welcome to Valhalla Legends Archive.
 

Allocating memory

Started by programming-design, October 11, 2005, 06:35 AM

Previous topic - Next topic

programming-design

Hello all,

I have a question regarding allocating memory in java. I am working on a base64 encoder/decoder in which I have a readFile function that returns a byte array of the entire files data. The problem lies when I am encoding/decoding large files (100 MB) how do I allocate a byte array to that size without getting an "out of memory" error?

Earlier I was looking for a method to allocate more space to a byte array and keep the data inside the array preserved without storing the data in a temporary buffer and just allocate while I'm reading from the buffer. Storing the data in a temp. buffer with a loop of say 10,000 to retrieve more data is hardly memory-saving code.

I looked into storing the buffers into a vector, but the problem then lies with accessing the data as a full byte array. I did manage to convert the vector into a "Byte" (uppercase B) array (rather than byte) but I couldn't find a way to convert that into a "byte" array. The reason for needing a final byte array is for outputting the new file data.

Anyhow, I'm pretty sure if I re-wrote all the functions to work with Vectors containing all the buffers it would work but again I'm not 100% sure and would be a shame doing all that re-writing to find out it doesn't work.

Basically I tried many different methods, and I haven't found a way to make this work. I was hoping one of you guys would be able to help?

gameschild

Firstly i would ask why are you storing that much in memory rather than processing a steam or something?

Secondly yes you could use a Vector to store multiple byte[]'s of say 50MB, must there is probably a better way to store if, such as a ByteBuffer maybe?

Dynobird

#2
You should read up on data structures. They're fun and interesting. But here's the answer to your question... And some classes you can use to create your data structures (below).

About converting a Byte into a byte array... you can't. A Byte is just an object that can hold a byte (primitive data type) and do things with it. You can make a Byte[] array however, and at the very last instant that you need the byte values you iterate through the array sending Byte.byteValue();

Arrays have a set size, yes. But there are classes out there that have ways of storing data getting around the array restriction. There are many ways of doing this. One is by the linked list method of doing things, by avoiding arrays completely, and another is by making a class that makes a bigger array when the time comes and copies the values of the smaller array into this larger one.

Linked lists don't even use arrays. Basically what linked lists are is a bunch of Nodes. You make a Node class that stores data for something, as well as a pointer to another Node. This pointer is simply a reference to another Node instance. The Node class has an accessor (get) method that returns the next Node, as well as a mutator method that sets this next Node (you put these arguments in the constructor too). The same works for the data/fields for each Node. You may be wondering: do I have to create a new Node instance for every new data item that I want to make? The answer is no, that would be too much work =p. What you do is make a Linked List class that has three Node fields: a head Node, a tail Node, and a temporary Node. This class has a mutator method which you give a data item and then it resets (erasing in the process) the value of the temporary Node. If head has no data item in it, then you make head reference the temp Node. If it does have it, then you have tail set it's next node to the temp node, and update tail by making it reference it's next node. Now, these are references - they don't erase the object that tail used to reference. After you're done adding in data items, you make a new Node (for clarity's sake) called Cursor and assign it to head. You do this so that after your method does its job head is preserved at the start of the linked list. So then you "curse" through your nodes by doing a Cursor.getValue() to return the value and perhaps print it out and then a Cursor = Cursor.getNext() to assign cursor to the next node. Then you repeat the process. Sorry if this is all confusing the way I'm writing it I'll post my own LinkedList and Node classes below.

One thing to know before you look at the classes. The Nodes end up taking Objects for values. Why? So you can add any kind of data item to your linked list. All objects extend Object, so all objects are kinds of Objects, so that's why this is possible. When you need your objects in their original form, you cast them back into their original form using "prefix-like" parentheses, like this:

Node someNode = new Node(String "string", null, null);
String nodeStr = (String)someNode.getValue();


This casts what is returned by getValue() into a String. The parentheses must be placed logically. For instance, If I wanted to call a method of the String object, then I would have to do this:

Node someNode = new Node(String "this is a string", null, null);
byte[] strToBytes = ((String)someNode.getValue()).getBytes();

This returns an Object, casts it into a String, and THEN makes a byte array out of it.


/** The node class. Each Node stores a data item and two pointers - one to the next and one to the last.
*/
public class Node {
   
    private Object value;        // data item
    private Node next;           // pointer
    private Node last;            // pointer
   
    /** I give my Node two pointers - one to the next Node in the list and one to the last
    */
    public Node(Object initValue, Node initNext, Node initLast) {
        value = initValue;
        next = initNext;
        last = initLast;
    }
   
    public void setValue(Object newValue) {
        value = newValue;
    }
   
    public void setNext(Node newNext) {
        next = newNext;
    }
   
    public void setLast(Node newLast) {
        last = newLast;
    }
   
    public Object getValue() {
        return value;
    }
   
    public Node getNext() {
        return next;
    }
   
    public Node getLast() {
        return last;
    }
}


/** Linked List class that maintains the list of nodes
*/
public class LinkedList {
   
    private Node head = null, tail = null;
    final String INDEX_OUT_OF_BOUNDS = "Index is not within the bounds of the Linked List. The Object returned is null.";
    final int EMPTY = 0;

    public LinkedList() {}
   
    public LinkedList(Object[] data) {
        Node temp;
        for (int i = 0; i < data.length; i++) {
            temp = new Node(data[i], null, null);
            if (head == null)
                head = temp;
            else
                tail.setNext(temp);
            tail = temp;
        }
    }
   
    public void add(Object value) {
        Node temp = new Node(value, null, null);
        if (head == null)
            head = temp;
        else
            tail.setNext(temp);
        tail = temp;
    }
   
    public void addFirst(Object value) {
        Node temp = new Node(value, head, null);
        head = temp;
    }
   
    public Object get(int index) {
        Node temp = head;
        if (index < 0 || index > size()) {
            System.out.println(INDEX_OUT_OF_BOUNDS);
            return null;
        }
        else {
            for (int i = 0; i < index; i++) {
                temp = temp.getNext();
            }
            return temp.getValue();
        }
    }
   
    public boolean isEmpty() {
        if (size() == EMPTY)
            return true;
        return false;
    }
   
    public int size() {
        Node temp = head;
        int count = 1;
        if (temp == null)
            return 0;
        else {
            while (temp.getNext() != null) {
                temp = temp.getNext();
                count++;
            }
            return count;
        }
    }
   
    /** This scrolls to the Node before the one to remove
    * And sets that Node's pointer to the one after the one to remove.
    * Then it sets the one to remove to null, although it's not even necessary
    */
    public void remove(int index) {
        Node temp = head;         
        if (index < 0 || index > size())
            System.out.println(INDEX_OUT_OF_BOUNDS);
        else {
            for (int i = 0; i < index - 1; i++) {
                temp = temp.getNext();
            }
            temp.setNext(temp.getNext().getNext());
        }
    }
   
    public void removeLast() {
        tail = tail.getLast();
        tail.setNext(null);
    }
   
    /** The ListIterator object just provides an easier way to iterate through the list
    */
    public ListIterator listIterator() {
        ListIterator returniter = new ListIterator(this);
        return returniter;
    }
}


/** This class makes it easy to iterate through a Linked List
*/
public class ListIterator {
   
    private LinkedList list;
    int cursor;
   
    public ListIterator(LinkedList list) {
        this.list = list;
    }
   
    public boolean hasNext() {
        if (cursor == list.size())
            return false;
        else {
            return true;
        }
    }
   
    public Object next() {
        Object returnvalue = list.get(cursor);
        cursor++;
        return returnvalue;
    }
   
    public void remove() {
        list.removeLast();     
    }
}


You can use my classes if you want. Sun has a LinkedList class of its own, with more methods. But its useful to see how its done because
you're eventually going to need specialized data structures, and you must understand the nature of how they are created in order to make your own.

I believe that Sun's ArrayList and Vector classes may use arrays and "refresh" the array just by making a larger array and copying the values of the original, smaller array into it. That would be your idea - but in OOP style - masking the complications of a task by performing them in a class which itself is easy-to-use by other objects. This is actually what the LinkedList class does - you could just create the Nodes in a main method or something and scroll through them there but the LinkedList class lets you make new objects of linked lists and makes the programmer's jobs considerably easier by "masking" the work. It's actually a design pattern, which is another thing you may want to read about.

programming-design

Quote from: gameschild on October 11, 2005, 02:08 PM
Firstly i would ask why are you storing that much in memory rather than processing a steam or something?

Secondly yes you could use a Vector to store multiple byte[]'s of say 50MB, must there is probably a better way to store if, such as a ByteBuffer maybe?

I tried a ByteBuffer. You have to allocate the memory before hand which caused me the out-of-memory error. You could also do it by allocating memory a little bit at a time but that would require storing a temporary buffer of the old data so you can re-allocate space then put the data back into the newly allocated array. I tried this but the process was VERY slow when dealing with large files.


Dynobrid,
Thanks for your well thought out reply. I will be checking up on what you said.

Kp

Why're you keeping the entire file in memory?  Allocate a large buffer, read in and encode data up until that buffer's capacity, then write out the buffer and start over.  Java doesn't handle mapping huge objects well.  If you really need to do that, consider memory mapping the file with mmap(2) or CreateFileMapping+MapViewOfFile.
[19:20:23] (BotNet) <[vL]Kp> Any idiot can make a bot with CSB, and many do!

Dynobird

#5
Just use a Vector. It'll do the job. And when you want a byte array, make a get method that returns one, like this:

public byte[] getBytes() {
        byte[] bytearray = new byte[buffer.size()]; // where buffer is your Vector
        for (int i = 0; i < buffer.size(); i++)
            bytearray[i] = ((Byte)buffer.get(i)).byteValue();
        return bytearray;
       
}

MyndFyre

#6
Let's think about how a Vector is allocated, so you know what you're getting into.

Very generally in some pseudo-pseudocode, this is how to implement a Vector:


class Vector
{
  private class ObjectNode
  {
    Object data;
    ObjectNode next;
  }

  ObjectNode head;

  Object getItemAt(int index)
    throws IndexOutOfBoundsException // is this the right exception?
  {
    if (index < 0) throw new IndexOutOfBoundsException("Index must be nonnegative integer.");
    if (head == null) throw new IndexOutOfBoundsException("No items in the vector.");

    ObjectNode cursor = head;
    for (int i = 0; i < index; i++)
    {
      cursor = cursor.next;
      if (cursor == null) throw new IndexOutOfBoundsException("Index beyond the bounds of the Vector.");
    }
    return cursor.data;
  }
}

Note the ObjectNode class.  That will store your Byte object.  Your Byte object is at least 1 byte in memory (wrapping the primitive type byte), although it may be aligned to a 4-byte boundary; for best-case, we'll use a single byte.  The Byte reference is a 4-byte reference in memory accessed by the ObjectNode object (since Byte derives from Object).  Each ObjectNode takes exactly 8 bytes of memory.  If you're going to have an ObjectNode for every byte in a 100MB file, you'll need to have 800MB of memory devoted to ObjectNodes, plus the 100MB of actual bytes you're representing.  So -- you're going to use 900MB of memory to convert a 100MB file?

If that wasn't bad enough, (one of the more experienced software engineers can correct me on this), but seeking operations in this can be O(sum(1..n)).  Unless the Vector maintains state information about the cursor, which is possible, every time the vector has a read operation, it has to go through every item.  If you have a 100mb byte array, accessing the last byte will cause ~100 million comparisons.  For a single 10 byte array, finding item 1, then 2, then 3 (using for (i = 0; i < 10; i++) { Object o = vector.getItemAt(i); }) incurs 55 comparison operations, 55 increment operations in just the getItemAt method alone.

On the other hand, you could use streams, which would be memory-mapped files (essentially).  Let's look at it this way: Base-64 encoding represents 6 bits in a row as an 8-bit ASCII character, according to this table.  6 and 8 share a common multiple of 24, which indicates that 3 bytes would be encoded as 4 ASCII base64 characters.  If you have a 100MB file of base64, you'll need to allocate only 75MB of memory to hold the file.  Specifically, you'll need:

fileSize * 3 / 4 + (fileSize % 4 == 0 ? 0 : 1)

bytes.  You might be able to chop off the extra modulo depending on how you handle extra bits.

You might not even need to allocate the extra memory if you're just reading from a file and writing to another file.  In this case, you have one FileReader and one FileWriter.  Base64 file?  Read in 4 bytes and write out 3.  Regular file?  Read in 3 bytes and write out 4.

Using this method, you only need to allocate ~ 9 bytes at a time for each read/write operation:

private static byte[] Base64EncodingTable =
{
(byte)'A', (byte)'B', (byte)'C', (byte)'D', (byte)'E', (byte)'F',
(byte)'G', (byte)'H', (byte)'I', (byte)'J', (byte)'K', (byte)'L',
(byte)'M', (byte)'N', (byte)'O', (byte)'P', (byte)'Q', (byte)'R',
(byte)'S', (byte)'T', (byte)'U', (byte)'V', (byte)'W', (byte)'X',
(byte)'Y', (byte)'Z', (byte)'a', (byte)'b', (byte)'c', (byte)'d',
(byte)'e', (byte)'f', (byte)'g', (byte)'h', (byte)'i', (byte)'j',
(byte)'k', (byte)'l', (byte)'m', (byte)'n', (byte)'o', (byte)'p',
(byte)'q', (byte)'r', (byte)'s', (byte)'t', (byte)'u', (byte)'v',
(byte)'w', (byte)'x', (byte)'y', (byte)'z', (byte)'0', (byte)'1',
(byte)'2', (byte)'3', (byte)'4', (byte)'5', (byte)'6', (byte)'7',
(byte)'8', (byte)'9', (byte)'+', (byte)'/'
};

private byte convertBase64CharToByte(byte val)
{
byte value = 0;
if ((val >= (byte)'A' && val <= (byte)'Z')
|| (val >= (byte)'a' && val <= (byte)'z') /* is alphabetic */
|| (val >= 0x30 && val <= 0x39) /* is numeric */
|| (val == (byte)'+') || (val == (byte)'/') /* is special */ )
{
if (val == (byte)'+') value = 62;
else if (val == (byte)'/') value = 63;
else if (val >= (byte)'A' && val <= (byte)'Z')
{
value = val - (byte)'A';
}
else if (val >= (byte)'a' && val <= (byte)'z')
{
value = val - (byte)'a' + 26;
}
else // if (val >= 0x30 && val <= 0x39)
{
value = val - 0x30 + 52;
}
}
else
{
throw new IllegalArgumentException("Value was not a base64 value.");
}

return value;
}
/**
* @param data 4-byte array containing input data.
* @return 3-byte array containing the data.
*/
public static byte[] convertBase64ToStandard(byte[] data)
throws IllegalArgumentException
{
/*
A base64 character set uses characters to represent blocks of 6 bits.
Thus it could look like this:
00111111 00222222 00333333 00444444
This is condensed into standard bit notation as such:
11111122 22223333 33444444
*/
if (data.length != 4) throw new java.lang.IllegalArgumentException("Length should be 4.");
byte[] value = new byte[3];
value[0] = (byte)((convertBase64CharToByte(data[0]) << 2) & 0xff);
byte temp = convertBase64CharToByte(data[1]);
value[0] += (temp >> 4);
value[1] = (byte)((temp << 4) & 0xff);
temp = convertBase64CharToByte(data[2]);
value[1] += (temp >> 4);
value[2] = (byte)((temp << 6) & 0xff);
temp = convertBase64CharToByte(data[3]);
value[2] += temp;

return value;
}

public static byte[] convertStandardToBase64(byte[] data)
throws IllegalArgumentException
{
if (data.length != 3) throw new IllegalArgumentException("Length should be 3.");
byte[] value = new byte[4];
byte temp = (data[0] & 0xfc);
value[0] = Base64EncodingTable[temp >> 2];
temp = (byte)((data[0] & 0x03) << 4);
temp |= ((data[1] & 0xf0) >> 4);
value[1] = Base64EncodingTable[temp];
temp = (byte)((data[1] & 0x0f) << 2);
temp |= ((data[2] & 0xc0) >> 6);
value[2] = Base64EncodingTable[temp];
temp = (byte)(data[2] & 0x3f);
value[3] = Base64EncodingTable[temp];

return value;
}


Then, you read in and write out to streams.  That's sooooo simple, and way more efficient than using a Vector.  3 or 4 in, 3 or 4 out (the opposite of what you took in), and 2 bytes total for temporary use within those methods.
QuoteEvery generation of humans believed it had all the answers it needed, except for a few mysteries they assumed would be solved at any moment. And they all believed their ancestors were simplistic and deluded. What are the odds that you are the first generation of humans who will understand reality?

After 3 years, it's on the horizon.  The new JinxBot, and BN#, the managed Battle.net Client library.

Quote from: chyea on January 16, 2009, 05:05 PM
You've just located global warming.

Dynobird

#7
@MyndFyre

Interesting idea. But for the user who wants to see this file... do you plan on converting the base 64 to hex (or binary)? Or do you plan on leaving it in base 64? The latter wouldn't seem like a solution, as base 64 isn't very easy to read... it's 2^6; you can't easily separate it into bytes. But as for the former question, if you are converting the base 64 to hex, wouldn't that take nearly as much (or more) work as going through the 50 million iterations that translate Objects to bytes would? Like (I hope this math is right) hex has to do 6/4 = 1.5 times more iterations than base 64 (as hex is 2^4, and base 64 = 2^6). So, if hex takes 50 million iterations, base 64 would be 50 x 10^6 x 4/6 = 50 x 10^6 x 2/3 = 33.3 (barred) million iterations. This is less iterations, but each iteration has to go through the work of converting base 64 to binary. So would this be around the same amount (if not more) work? Or perhaps I'm not getting it.

MyndFyre

Okay look.

Base64 represents binary data (data stored in 8-bit bytes) as one of 64 characters.  64 different representations indicates that a Base64-encoded character can represent 6 bits.

Let's say you have the following data:

00111001 10011111 00010000

That's 3 bytes.

This would be separated into 4 Base64 characters.  Base64 uses a subset of ASCII, so you're actually using 8 bits to encode 6 bits of data.  But that's okay.  These are the characters that would be Base64:

00111001 10011111 00010000

So, calculate the values of those:
14 25 60 16
Then into Base64 according to the conversion table:
OZ8Q

At a minimum, data must be encoded into Base64 in chunks of 3 bytes or decoded from Base64 in chunks of 4, or else you have incomplete chunks.  You could, of course, encode chunks of 6 bytes to 8, or decode 8 to 6, but it's the same operation repeated; there's no way to make it more convenient.

To translate every byte in a file, you're going to require 100,000,000 + 99,999,999 + 99,999,998 + ... + 1 iterations if you use a Vector.  Or, sum (i=1 to n) (i) (which is why I gave big-O notation earlier).  Then, you *still* have to go through all the effort of my method to convert to or from Base64.  Then to write it back out, you need to go through n iterations.

So, for a 100-MB file, to convert from Base64 to standard binary (assuming MB is 1 million bytes, to make it easy), you need sum(i=1 to 100,000,000)(i) (to read the file) + 100,000,000 / 4 (to convert the file) + 100,000,000 * 0.75 (to write the file).  Writing the vector is an O(1) operation assuming you remove the first item in the vector as you write it.

Using my method, you read, convert, and write back all in one operation.  To do file-to-file, to convert a 100mb from Base64 to standard binary, you need 25 million reads, each taking the same 9 bytes of memory, then 25 million writes, but all in the same operation, so you only have 25 million iterator comparisons.  Sure, that's a lot of writes, but it's 9 bytes of memory.  A good Java implementation will perform read-ahead and write-ahead caching to increase I/O performance.

Want to read it into a byte array so that it can be displayed as text in some way to the user?  Simply allocate a byte[], then create a ByteArrayMemoryStream and use a stream writer object to write the data to that instead of to a file on disk.  byte[] data = new byte[100000000]; ByteArrayMemoryStream datastream = new ByteArrayMemoryStream(data); done.

I don't know why you asked about hex.  I suppose you could translate Base64 to hex instead, but that's even more wasteful of space.  Two Base64 characters can be translated to 3 hex characters.  But Base64 is traditionally used to encode binary data, such as images, that couldn't be transmitted over e-mail systems (newer systems generally use UUencode, but that's beside the point).

And I certainly don't plan on doing any of this.  It's not my project.
QuoteEvery generation of humans believed it had all the answers it needed, except for a few mysteries they assumed would be solved at any moment. And they all believed their ancestors were simplistic and deluded. What are the odds that you are the first generation of humans who will understand reality?

After 3 years, it's on the horizon.  The new JinxBot, and BN#, the managed Battle.net Client library.

Quote from: chyea on January 16, 2009, 05:05 PM
You've just located global warming.