Column Management in CustomDataFrame

This guide covers how to work with columns in CustomDataFrame, including adding, removing, and managing column types.

Understanding Column Specifications (colspecs)

CustomDataFrame organizes columns into different types based on their purpose:

Input Types (Immutable): Read-only data that shouldn’t be modified after loading. Types: input, data, constants, global_consts
Output Types (Mutable, Written to Files): Data that will be written to output files. Types: output, writedata, writeseries, parameters, globals
Cache Types (Mutable, Temporary): Intermediate calculations that don’t need to be saved. Types: cache, series_cache, parameter_cache, global_cache
Parameter Types (Global Values): Values that are globally valid without an index. Types: parameters, globals, parameter_cache, global_cache
Series Types (Multiple Rows Per Index): Data that may have multiple rows with the same index value. Types: series, writeseries, series_cache

Adding Columns

Method 1: Direct Assignment (Recommended for Quick Operations)

The simplest way to add a column is through direct assignment:

import pandas as pd
from lynguine.assess.data import CustomDataFrame

# Create a dataframe
df = CustomDataFrame(pd.DataFrame({'A': [1, 2, 3]}))

# Add a column using direct assignment (goes to 'cache' by default)
df['B'] = [4, 5, 6]

# You can also use pandas Series
df['C'] = pd.Series([7, 8, 9], index=df.index)

Direct assignment automatically adds the column to the cache colspec type, making it mutable but not saved to output files.

Method 2: add_column() Method (Recommended for Explicit Type Control)

Use add_column() when you need explicit control over the column type:

# Add a column with default type (cache)
df.add_column('new_col', [10, 11, 12])

# Add a column as output type (will be written to files)
df.add_column('result', [13, 14, 15], colspec='output')

# Add a parameter (global value)
df.add_column('threshold', [0.5, 0.5, 0.5], colspec='parameters')

Advantages of add_column():

Explicit type specification
Validation that column doesn’t already exist
Validation that colspec type is valid
Self-documenting code

Removing Columns

Use the drop_column() method to remove columns:

# Drop a single column
df.drop_column('unwanted_col')

# The column is removed from both the data and column specifications

Note: This permanently removes the column. There is no undo.

Common Patterns

Pattern 1: Pre-creating Columns for Compute Operations

Some compute operations expect columns to exist:

# Pre-create columns that compute operations will populate
df['word_count'] = None
df['sentiment_score'] = None

# Then run compute operations that fill these columns
df.compute.run()

Pattern 2: Converting Column Types

To move a column to a different colspec type:

# Save the data
col_data = df['temp_col']

# Drop from current location
df.drop_column('temp_col')

# Re-add with new type
df.add_column('temp_col', col_data, colspec='output')

Pattern 3: Working with Series Data (Multiple Rows Per Index)

For data with multiple rows per index value:

# Create a series with repeated indices
series_data = pd.Series([1, 2, 3, 4], index=['a', 'a', 'b', 'b'])

# This automatically goes to a series-type colspec
df['multi_value_col'] = series_data

Best Practices

Use add_column() for clarity: When the column type matters, use add_column() with an explicit colspec parameter.
Use direct assignment for quick work: For temporary calculations or when working interactively, direct assignment (df['col'] = value) is more concise.
Choose the right colspec type:
- Use input for source data that shouldn’t change
- Use output for results you want to save
- Use cache for intermediate calculations
- Use parameters for global configuration values
Validate before dropping: Check if a column exists before dropping to avoid errors:
```
if 'col_name' in df.columns:
    df.drop_column('col_name')
```

Common Errors and Solutions

Error: Column already exists

# ❌ This raises ValueError
df.add_column('existing_col', [1, 2, 3])

Solution: Use direct assignment to replace values, or drop first:

# ✅ Replace values
df['existing_col'] = [1, 2, 3]

# ✅ Or drop and re-add
df.drop_column('existing_col')
df.add_column('existing_col', [1, 2, 3], colspec='output')

Error: Column not found

# ❌ This raises KeyError
df.drop_column('nonexistent_col')

Solution: Check if column exists first:

# ✅ Check first
if 'nonexistent_col' in df.columns:
    df.drop_column('nonexistent_col')

Error: Invalid colspec

# ❌ This raises ValueError
df.add_column('col', [1, 2, 3], colspec='invalid_type')

Solution: Use one of the valid colspec types listed at the top of this guide.