SEER database analysis toolkit written in python3. The tool is command line driven with the configuration infomation provided via arguments and config files. The tool has been primariy tested on *nix based systems with data being stored in mongodb.
$ ./main.py <arguments>
or
$ python3 main.py <arguments>
The following is an example of a json configuration file used for tuning a decision tree. There are other example json files in the /example folder of this git repo.
{
"dataSource": {
"mongoDb": {
"ip": "localhost",
"port": "27017"
},
"targetName" : "IdValue",
"data" : {
"collectionName1" : [
{"filedA" : "Values"},
{"fieldB" : "Values, Values"}
],
"collectionName2" : [
{"fieldA" : "values"}
]
}
},
"decisionTree": {
"maxTreeDepth": 3,
"maxFeatures": 2
},
"output": {
"saveJson": 1
}
}Contains one of the suboptions listed below, to select where the data will come from.
| Name | Value | Description |
|---|---|---|
| targetName | str | Name of target feature loaded from data |
| data Source | json Element | Element with configure info for loading data ('MongoDb', 'csvFile', ect) |
pull data from mongo db
| Name | Value | Description |
|---|---|---|
| ip | str | string to the ip where the Mongo DB is running |
| port | int | port number for mongo DB server |
| database | str | name of mongodb database |
| data | json Element | Element listing collection & field names |
pull data from csv file
| Name | Value | Description |
|---|---|---|
| filePath | str | Path to CSV file |
| Name | Value | Description |
|---|---|---|
| split | int | Percentage of data to allocate for testing |
| randomSeed | int | Provide fixed seed value for deterministic results |
The next element is for a computation to run with the provided data based on the datasource.
Arguments depend on the method which is being the algorithm is being used. When gridSearch is enabled, all DT arguments listed as such need to be array values (even if they are singular)
| Name | Value | Description |
|---|---|---|
| maxTreeDepth | int \ array | Tree depth |
| maxFeatures | int \ array | Max number of feature to be used |
| minSplitNum | int \ array | Minium Split Value |
| randomSeed | int | Provide fixed seed value for deterministic results |
| gridSearch | int \ bool | When set to '1' or 'True', will enable gride search with the provided array values |
| Name | Value | Description |
|---|---|---|
| predictors | array | which predictors are to be used, more info below |
This is a json style array where the first element is the name of the predictors. The name should match data's fields name being loaded from the datasource (i.e. the column name if the data source is a csv). The second part 'value', indicates the datatype 'l' linear numeric value, 'c' for categorical
"predictors" : [
{"height" : "l"},
{"isHuman" : "c"}
]Sets the location and parameters for where the resulting output files should be saved
| Name | Value | Description |
|---|---|---|
| saveJson | 0,1 | Save the JSON file used alone with the output data |
| directory | text | Directory location to save output data |
| timestamp | 0,1 | Append timestamp to output directory name |
The follow is a list of dependencies used within the compute pacakge
- sklearn
- numpy
- pandas
- Move parser from JSON to YAML
- Dynamic loading of classes
- UI frontend